{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#  活动推荐算法\n",
    "    \n",
    "Event Recommendation Engine Challenge是Kaggle上的一个推荐任务竞赛，根据用户的历史参加活动记录、社交信息以及在 App上浏览和点击的活动，预测用户是否会对某个活动感兴趣。\n",
    "\n",
    "根据协同过滤算法原理，采用3种算法预测用户对活动的感兴趣程度：    \n",
    "1. 基于用户的协同过滤(用户本身，用户好友分析)；     \n",
    "2. 基于活动的协同过滤(活动本身，用户活动关联分析)；     \n",
    "3. 基于模型的协同过滤。   "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 1.  数据分析"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "from numpy.random import random \n",
    "\n",
    "import scipy.sparse as ss\n",
    "import scipy.io as sio\n",
    "import scipy.spatial.distance as ssd\n",
    "\n",
    "from collections import defaultdict\n",
    "from sklearn.preprocessing import normalize\n",
    "from utils import FeatureEng\n",
    "\n",
    "import itertools\n",
    "import pickle\n",
    "\n",
    "from sklearn.metrics import mean_squared_error\n",
    "\n",
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "训练集和测试集中用户总数： 3391\n",
      "训练集和测试集中事件总数： 13418\n"
     ]
    }
   ],
   "source": [
    "#  统计训练集和测试集中的用户('user')和活动('event') \n",
    "\n",
    "uniqueUsers = set()\n",
    "uniqueEvents = set() \n",
    "\n",
    "for filename in ['train.csv','test.csv']:\n",
    "    f = open(filename)\n",
    "    f.readline().strip().split(',')       \n",
    "    for line in f:\n",
    "        cols = line.strip().split(',')\n",
    "        uniqueUsers.add(cols[0])         \n",
    "        uniqueEvents.add(cols[1])         \n",
    "    f.close()\n",
    "        \n",
    "n_uniqueUsers = len(uniqueUsers)\n",
    "n_uniqueEvents = len(uniqueEvents)\n",
    "print('训练集和测试集中用户总数：',n_uniqueUsers)\n",
    "print('训练集和测试集中活动总数：',n_uniqueEvents)\n",
    "\n",
    "userIndex = dict()\n",
    "eventIndex = dict()\n",
    "\n",
    "for i,u in enumerate(uniqueUsers):\n",
    "    userIndex[u] = i\n",
    "for i,e in enumerate(uniqueEvents):\n",
    "    eventIndex[e] = i\n",
    "\n",
    "pickle.dump(userIndex,open('userIndex.pkl','wb'))\n",
    "pickle.dump(eventIndex,open('eventIndex.pkl','wb'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "统计训练集和测试集中的用户('user')和活动('event')，构建用户和活动索引字典：         \n",
    "1) userIndex.pkl文件： 保存 {‘用户’：索引} 字典：{'3389282421': 0, '1066372954': 1, '2010045207': 2, . . .}       \n",
    "2) eventIndex.pkl文件：保存 {‘活动’：索引} 字典：{'2223895704': 0, '3923328825': 1, '1184964740': 2, . . .}      "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 统计'用户-活动'关系 \n",
    "\n",
    "eventsForUser = defaultdict(set)   \n",
    "usersForEvent = defaultdict(set)\n",
    "\n",
    "userEventScores = ss.dok_matrix((n_uniqueUsers, n_uniqueEvents)) \n",
    "\n",
    "ftrain = open('train.csv')\n",
    "ftrain.readline() \n",
    "for line in ftrain:\n",
    "    cols = line.strip().split(',')\n",
    "    i = userIndex[cols[0]]          \n",
    "    j = eventIndex[cols[1]]          \n",
    "    \n",
    "    eventsForUser[i].add(j)          \n",
    "    usersForEvent[j].add(i)         \n",
    "    \n",
    "    score = int(cols[4])\n",
    "    userEventScores[i,j] = score     \n",
    "ftrain.close()\n",
    "\n",
    "pickle.dump(eventsForUser,open('eventsForUser.pkl','wb')) \n",
    "pickle.dump(usersForEvent,open('usersForEvent.pkl','wb'))\n",
    "\n",
    "sio.mmwrite('userEventScores',userEventScores)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "统计'用户-活动'关系:      \n",
    "1) eventsForUser.pkl文件：保存‘用户参加的活动’：{用户：{参加的活动}} -> {1437: {11779, 2022, 9447, 7081, 1501, 11806}, 2491: {9475, 12103, 9320, 8592, 4273, 10228, 10325}, . . .}         \n",
    "2) usersForEvent.pkl文件：保存‘参加活动的用户’：{参加的活动：{用户}} -> {9447: {1433, 1437}, 1501: {1321, 868, 1437, 846}, . . .}     \n",
    "3) userEventScores.mtx文件：保存‘用户对活动的兴趣’：{(用户,活动) 兴趣} ->        \n",
    "(1437, 2022)\t1.0        \n",
    "(2491, 10325)\t1.0        \n",
    "(2491, 8592)\t1.0         \n",
    ". . .    \n",
    "userEventScores.getrow(i).todense()可获得索引号为i的用户对所有活动的兴趣，比如对于2491号用户：[[0. 0. 0. ... 0. 0. 0.]]     "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 统计关联用户和关联活动 \n",
    "\n",
    "uniqueUserPairs = set() \n",
    "uniqueEventPairs = set() \n",
    "\n",
    "for event in uniqueEvents:\n",
    "    i = eventIndex[event]         \n",
    "    users = usersForEvent[i]      \n",
    "    if len(users)>2:\n",
    "        uniqueUserPairs.update(itertools.combinations(users,2))  \n",
    "\n",
    "for user in uniqueUsers:            \n",
    "    u = userIndex[user]           \n",
    "    events = eventsForUser[u]     \n",
    "    if len(events)>2:\n",
    "        uniqueEventPairs.update(itertools.combinations(events,2))  \n",
    "        \n",
    "pickle.dump(uniqueUserPairs,open('uniqueUserPairs.pkl','wb'))\n",
    "pickle.dump(uniqueEventPairs,open('uniqueEventPairs.pkl','wb'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "统计关联用户和关联活动:       \n",
    "1) uniqueUserPairs.pkl文件：参加某一活动的所有用户，形成的两两用户对，即关联用户：{(234, 3007), (2912, 1979), (2197, 533), (681, 1396), (1417, 3028), (3306, 3335), (2167, 3339), (849, 1982), . . .}       \n",
    "2) uniuqeEventPairs.pkl文件：某个用户参加的所有活动，形成的两两活动对，即关联活动：{(4001, 12866), (2261, 2169), (10614, 4855), (10036, 3480), (612, 2088), (2758, 5259), (161, 5787), (6304, 12369), . . .}     "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>user_id</th>\n",
       "      <th>locale</th>\n",
       "      <th>birthyear</th>\n",
       "      <th>gender</th>\n",
       "      <th>joinedAt</th>\n",
       "      <th>location</th>\n",
       "      <th>timezone</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>3197468391</td>\n",
       "      <td>id_ID</td>\n",
       "      <td>1993</td>\n",
       "      <td>male</td>\n",
       "      <td>2012-10-02T06:40:55.524Z</td>\n",
       "      <td>Medan  Indonesia</td>\n",
       "      <td>480.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>3537982273</td>\n",
       "      <td>id_ID</td>\n",
       "      <td>1992</td>\n",
       "      <td>male</td>\n",
       "      <td>2012-09-29T18:03:12.111Z</td>\n",
       "      <td>Medan  Indonesia</td>\n",
       "      <td>420.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>823183725</td>\n",
       "      <td>en_US</td>\n",
       "      <td>1975</td>\n",
       "      <td>male</td>\n",
       "      <td>2012-10-06T03:14:07.149Z</td>\n",
       "      <td>Stratford  Ontario</td>\n",
       "      <td>-240.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1872223848</td>\n",
       "      <td>en_US</td>\n",
       "      <td>1991</td>\n",
       "      <td>female</td>\n",
       "      <td>2012-11-04T08:59:43.783Z</td>\n",
       "      <td>Tehran  Iran</td>\n",
       "      <td>210.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>3429017717</td>\n",
       "      <td>id_ID</td>\n",
       "      <td>1995</td>\n",
       "      <td>female</td>\n",
       "      <td>2012-09-10T16:06:53.132Z</td>\n",
       "      <td>NaN</td>\n",
       "      <td>420.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      user_id locale birthyear  gender                  joinedAt  \\\n",
       "0  3197468391  id_ID      1993    male  2012-10-02T06:40:55.524Z   \n",
       "1  3537982273  id_ID      1992    male  2012-09-29T18:03:12.111Z   \n",
       "2   823183725  en_US      1975    male  2012-10-06T03:14:07.149Z   \n",
       "3  1872223848  en_US      1991  female  2012-11-04T08:59:43.783Z   \n",
       "4  3429017717  id_ID      1995  female  2012-09-10T16:06:53.132Z   \n",
       "\n",
       "             location  timezone  \n",
       "0    Medan  Indonesia     480.0  \n",
       "1    Medan  Indonesia     420.0  \n",
       "2  Stratford  Ontario    -240.0  \n",
       "3        Tehran  Iran     210.0  \n",
       "4                 NaN     420.0  "
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 读取文件‘users.csv’,用于对输入特征进行特征编码\n",
    "\n",
    "users = pd.read_csv('users.csv')\n",
    "users.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 对在训练集和测试集中出现的user_id对应的输入特征进行特征编码\n",
    "\n",
    "userIndex = pickle.load(open('userIndex.pkl','rb'))\n",
    "n_users = len(userIndex)\n",
    "\n",
    "FE = FeatureEng.FeatureEng()                    \n",
    "n_cols = users.shape[1]-1 \n",
    "userMatrix = ss.dok_matrix((n_users, n_cols))   \n",
    "\n",
    "cols = ['LocaleId', 'BirthYearInt', 'GenderId', 'JoinedYearMonth', 'CountryId', 'TimezoneInt'] \n",
    "for u in range(len(users)): \n",
    "    userId = str(users.loc[u,'user_id'])                                 \n",
    "    if userId in userIndex.keys():                                      \n",
    "        i = userIndex[userId]                                            \n",
    "        userMatrix[i,0] = FE.getLocaleId(users.loc[u,'locale'])          \n",
    "        userMatrix[i,1] = FE.getBirthYearInt(users.loc[u,'birthyear'])\n",
    "        userMatrix[i,2] = FE.getGenderId(users.loc[u,'gender'])\n",
    "        userMatrix[i,3] = FE.getJoinedYearMonth(users.loc[u,'joinedAt'])\n",
    "        userMatrix[i,4] = FE.getCountryId(users.loc[u,'location'])     \n",
    "        userMatrix[i,5] = FE.getTimezoneInt(users.loc[u,'timezone'])\n",
    "        \n",
    "userMatrix = normalize(userMatrix, norm=\"l2\", axis=0, copy=False)\n",
    "sio.mmwrite('userMatrix', userMatrix)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 计算关联用户的相似度 \n",
    "\n",
    "userSimMatrix = ss.dok_matrix((n_users,n_users))\n",
    "for i in range(n_users):             \n",
    "    userSimMatrix[i,i] = 1.0            \n",
    "\n",
    "uniqueUserPairs = pickle.load(open('uniqueUserPairs.pkl','rb'))   \n",
    "for u1,u2 in uniqueUserPairs: \n",
    "    i = u1 \n",
    "    j = u2 \n",
    "    if (i,j) not in userSimMatrix.keys():\n",
    "        usim = ssd.correlation(userMatrix.getrow(i).todense(),userMatrix.getrow(j).todense())  \n",
    "        userSimMatrix[i,j] = usim \n",
    "        userSimMatrix[j,i] = usim \n",
    "sio.mmwrite('userSimMatrix',userSimMatrix)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "用户属性的特征编码和关联用户的相似度:           \n",
    "1) userMatrix.mtx文件保存用户的特征编码：    \n",
    "   (861, 0)\t0.021628856085564065     \n",
    "   (1356, 0)\t0.011957416372506963    \n",
    "    . . .    \n",
    "   (1785, 5)\t-0.01303142053748795    \n",
    "   (2847, 5)\t0.020850272859980718    \n",
    "    . . .       \n",
    "userMatrix.getrow(i).todense()可以读取索引号为i的用户的特征，比如第861号用户的特征为：[[ 0.02162886  0.01728704  0.01196254  0.01770409  0.         -0.01303142]] \n",
    "\n",
    "2) userSimMatrix文件保存关联用户的相似度：(关联用户在文件uniqueUserPairs.pkl中)            \n",
    "  (828, 1821)\t1.3388248003500465       \n",
    "  (147, 466)\t0.7114964412756368    \n",
    "  (466, 147)\t0.7114964412756368     \n",
    "   . . . "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 对在训练集和测试集中出现的event_id对应的输入特征进行特征编码\n",
    "\n",
    "userIndex = pickle.load(open('eventIndex.pkl','rb'))\n",
    "n_events = len(eventIndex)\n",
    "\n",
    "FE = FeatureEng.FeatureEng()\n",
    "eventPropMatrix = ss.dok_matrix((n_events, 7))    \n",
    "eventContMatrix = ss.dok_matrix((n_events, 101))  \n",
    "\n",
    "import linecache                                        \n",
    "fevent = linecache.getlines(\"events.csv\") \n",
    "for line in fevent: \n",
    "    cols = line.strip().split(\",\")\n",
    "    eventId = str(cols[0])\n",
    "    if eventId in eventIndex.keys():                                    \n",
    "        i = eventIndex[eventId]                                         \n",
    "        eventPropMatrix[i,0] = FE.getJoinedYearMonth(cols[2])              \n",
    "        eventPropMatrix[i,1] = FE.getFeatureHash(cols[3].encode('utf-8'))  \n",
    "        eventPropMatrix[i,2] = FE.getFeatureHash(cols[4].encode('utf-8'))  \n",
    "        eventPropMatrix[i,3] = FE.getFeatureHash(cols[5].encode('utf-8')) \n",
    "        eventPropMatrix[i,4] = FE.getFeatureHash(cols[6].encode('utf-8'))  \n",
    "        eventPropMatrix[i,5] = FE.getFloatValue(cols[7])                   \n",
    "        eventPropMatrix[i,6] = FE.getFloatValue(cols[8])                   \n",
    "        for j in range(9, 110):                                          \n",
    "            eventContMatrix[i,j-9] = cols[j]\n",
    "linecache.clearcache()\n",
    "\n",
    "eventPropMatrxi = normalize(eventPropMatrix,norm='l2',axis=0,copy=False)\n",
    "sio.mmwrite('eventPropMatrix',eventPropMatrix)\n",
    "\n",
    "eventContMatrix = normalize(eventContMatrix,norm='l2',axis=0,copy=False)\n",
    "sio.mmwrite('eventContMatrix', eventContMatrix)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "C:\\Users\\l\\Anaconda3\\lib\\site-packages\\scipy\\spatial\\distance.py:644: RuntimeWarning: invalid value encountered in double_scalars\n",
      "  dist = 1.0 - uv / np.sqrt(uu * vv)\n"
     ]
    }
   ],
   "source": [
    "# 计算关联活动的相似度\n",
    "\n",
    "eventPropSim = ss.dok_matrix((n_events, n_events))   \n",
    "eventContSim = ss.dok_matrix((n_events, n_events))  \n",
    "\n",
    "uniqueEventPairs = pickle.load(open('uniqueEventPairs.pkl', 'rb')) \n",
    "\n",
    "for e1, e2 in uniqueEventPairs:\n",
    "    i = e1\n",
    "    j = e2  \n",
    "    if (i,j) not in eventPropSim.keys():       \n",
    "        epsim = ssd.correlation(eventPropMatrix.getrow(i).todense(),eventPropMatrix.getrow(j).todense()) \n",
    "        eventPropSim[i,j] = epsim\n",
    "        eventPropSim[j,i] = epsim\n",
    "    if (i,j) not in eventContSim.keys():     \n",
    "        ecsim = ssd.cosine(eventContMatrix.getrow(i).todense(),eventContMatrix.getrow(j).todense())\n",
    "        eventContSim[i,j] = epsim\n",
    "        eventContSim[j,i] = epsim\n",
    "\n",
    "sio.mmwrite('eventPropSim', eventPropSim)\n",
    "sio.mmwrite('eventContSim', eventContSim)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "活动的特征编码和关联活动的相似度:  \n",
    "1) eventPropMatrix.mtx文件保存非词频的特征编码:  \n",
    "  (9020, 0)\t34.0   \n",
    "  (9020, 1)\t-1.0   \n",
    "  (9020, 2)\t-1.0   \n",
    "   . . .     \n",
    "2) eventContMatrix.mtx文件保存词频的特征编码：  \n",
    "  (9020, 0)\t0.03457021969314784    \n",
    "  (5079, 0)\t0.03457021969314784     \n",
    "  (11304, 0)\t0.01728510984657392  \n",
    "  . . .    \n",
    "3) eventPropSim.mtx文件保存关联活动非词频特征的相似度：(关联用户在文件uniqueEventPairs.pkl中)     \n",
    "4) eventContSim.mtx文件保存关联活动词频特征的相似度     "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 统计‘用户-好友’对活动的兴趣\n",
    "\n",
    "numFriends = np.zeros((n_users))                           \n",
    "userFriends = ss.dok_matrix((n_users, n_users))\n",
    "\n",
    "import linecache                                         \n",
    "fri = linecache.getlines(\"user_friends.csv\") \n",
    "for line in fri:                                     \n",
    "    cols = line.strip().split(\",\")\n",
    "    user = str(cols[0])                                 \n",
    "    if user in userIndex.keys():                        \n",
    "        friends = cols[1].split(\" \")                  \n",
    "        i = userIndex[user]                            \n",
    "        numFriends[i] = len(friends)                  \n",
    "        for friend in friends:                       \n",
    "            str_friend = str(friend)\n",
    "            if str_friend in userIndex.keys():        \n",
    "                j = userIndex[str_friend]             \n",
    "                newuser_score = userEventScores.getrow(j).todense()          \n",
    "                score = newuser_score .sum() / np.shape(newuser_score)[1]    \n",
    "                userFriends[i,j] += score                                    \n",
    "                userFriends[j,i] += score  \n",
    "linecache.clearcache()\n",
    " \n",
    "sumNumFriends = numFriends.sum(axis=0)\n",
    "numFriends = numFriends / sumNumFriends\n",
    "sio.mmwrite('numFriends', np.matrix(numFriends))\n",
    "\n",
    "userFriends = normalize(userFriends, norm=\"l2\", axis=0, copy=False)\n",
    "sio.mmwrite('userFriends', userFriends)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "统计‘用户-好友’对活动的兴趣：     \n",
    "1) numFriends.mtx文件保存用户的朋友总数     \n",
    "2) userFriends.mtx文件保存用户的好友对活动的兴趣：       \n",
    "  (3028, 19)\t1.0    \n",
    "  (3272, 221)\t1.0      \n",
    "   . . .     \n",
    "  (2044, 237)\t0.4472135954999579    \n",
    "  (1550, 237)\t0.8944271909999159     \n",
    "    . . .    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 统计某个活动，参加和不参加的人数，计算活动热度\n",
    "\n",
    "eventIndex = pickle.load(open(\"eventIndex.pkl\", 'rb'))\n",
    "n_events = len(eventIndex)\n",
    "eventPopularity = ss.dok_matrix((n_events,1))\n",
    "\n",
    "import linecache  \n",
    "event_file = linecache.getlines('event_attendees.csv') \n",
    "for line in event_file:  \n",
    "    cols = line.strip().split(',')\n",
    "    eventId = str(cols[0])\n",
    "    if eventId in eventIndex.keys():\n",
    "        i = eventIndex[eventId]                                                    \n",
    "        eventPopularity[i,0] = len(cols[1].split(' ')) - len(cols[4].split(' '))   \n",
    "\n",
    "eventPopularity = normalize(eventPopularity,norm='l1',axis=0,copy=False)\n",
    "sio.mmwrite('eventPopularity',eventPopularity)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "计算活动热度:  \n",
    "1) eventPopularity.mtx文件保存活动活跃度：\n",
    "[[ 0.        ]\n",
    " [ 0.        ]\n",
    " [ 0.        ]\n",
    " ...\n",
    " [ 0.        ]\n",
    " [-0.00145762]\n",
    " [ 0.        ]]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " --------------------------------------------------------\n",
    "小结1：   \n",
    "分析数据集中的用户('user')和活动('event')属性，主要分析结果保存为文件，具体包括：        \n",
    "1) userIndex和eventIndex： 分别是{‘用户’：索引} 和 {‘活动’：索引} 字典，部分具体数据为：{'2223895704': 0, '3923328825': 1, '1184964740': 2, . . .}\n",
    "\n",
    "2) eventsForUser和usersForEvent：分别是‘用户参加的活动’-> {用户：{参加的活动}}和‘参加活动的用户’-> {参加的活动：{用户}}，部分具体数据为： {1437: {11779, 2022, 9447, 7081, 1501, 11806}, 2491: {9475, 12103, 9320, 8592, 4273, 10228, 10325}, . . .}\n",
    "\n",
    "3) userEventScores：是‘用户对活动的兴趣’-> {(用户,活动) 兴趣} ，部分具体数据为：  \n",
    "(1437, 2022) 1.0\n",
    "(2491, 10325) 1.0\n",
    "(2491, 8592) 1.0\n",
    ". . .\n",
    "userEventScores.getrow(i).todense()可获得索引号为i的用户对所有活动的兴趣。\n",
    "\n",
    "4) uniqueUserPairs和uniuqeEventPairs：分别是关联用户和关联活动，关联用户是参加某一活动的所有用户，形成的两两用户对，关联活动是某个用户参加的所有活动，形成的两两活动对，部分具体数值为：{(234, 3007), (2912, 1979), (2197, 533), (681, 1396),  . . .}   \n",
    "\n",
    "5) userMatrix是用户属性的特征编码，userMatrix.getrow(i).todense()可以读取索引号为i的用户的特征。userSimMatrix是关联用户的相似度。\n",
    "\n",
    "6) eventPropMatrix是活动的非词频特征编码，eventContMatrix是活动的词频特征编码。eventPropSim是关联活动的非词频特征的相似度，eventContSim是关联活动词频特征的相似度。\n",
    "\n",
    "7) ‘用户-好友’对活动的兴趣：numFriends是用户的朋友总数；userFriends是用户的好友对活动的兴趣，部分具体数据为：  \n",
    "(3028, 19) 1.0\n",
    "(3272, 221) 1.0\n",
    ". . .\n",
    "(2044, 237) 0.4472135954999579\n",
    "(1550, 237) 0.8944271909999159\n",
    ". . . \n",
    "\n",
    "8) eventPopularity是活动活跃度： [[ 0. ] [ 0. ] [ 0. ] ... [ 0. ] [-0.00145762] [ 0. ]]\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "-------------------------------------------------------------\n",
    "# 2.  数据读取   \n",
    "直接读取前面分析保存好的数据，读取的数据用于后面的协同过滤。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 读取上面数据分析保存的文件，对数据做初始化\n",
    "\n",
    "# 用户和活动索引\n",
    "userIndex = pickle.load(open('userIndex.pkl','rb'))\n",
    "eventIndex = pickle.load(open('eventIndex.pkl','rb'))\n",
    "n_users = len(userIndex)\n",
    "n_items = len(eventIndex)\n",
    "\n",
    "# 用户参加的活动和参加活动的用户\n",
    "itemsForUser = pickle.load(open('eventsForUser.pkl','rb'))\n",
    "usersForItem = pickle.load(open('usersForEvent.pkl','rb'))\n",
    "\n",
    "# 用户-活动关系矩阵 \n",
    "userEventScores = sio.mmread('userEventScores').todense()\n",
    "\n",
    "# 用户属性的特征编码和根据用户属性计算出的用户之间的相似度\n",
    "userMatrix = sio.mmread('userMatrix').todense()\n",
    "userSimMatrix = sio.mmread('userSimMatrix').todense()\n",
    "\n",
    "# 活动属性的特征编码和根据活动属性计算出的活动之间的相似度\n",
    "eventPropMatrix = sio.mmread('eventPropMatrix').todense()\n",
    "eventContMatrix = sio.mmread('eventContMatrix').todense()\n",
    "eventPropSim = sio.mmread('eventPropSim').todense() \n",
    "eventContSim = sio.mmread('eventContSim').todense() \n",
    "\n",
    "# 每个用户的朋友的数目和每个朋友参加活动的分数对该用户的影响\n",
    "numFriends = sio.mmread('numFriends')\n",
    "userFriends = sio.mmread('userFriends').todense()\n",
    "\n",
    "# 活动本身的热度\n",
    "eventPopularity = sio.mmread('eventPopularity').todense()\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "15398 3\n"
     ]
    }
   ],
   "source": [
    "# 读取训练集中的数据，并转化为列表格式，用来比较预测分值与实际分值的误差平方和\n",
    "\n",
    "dtrain = pd.read_csv('train.csv')\n",
    "data = dtrain[['user','event','interested']].values\n",
    "n,m = data.shape \n",
    "print(n,m)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---------------------------------------------\n",
    "# 3.  实现基于用户的协同过滤（用户本身，用户好友分析）\n",
    "基于用户的协同过滤，用两种方法计算两个用户user_1和user_2之间的相似度，分别是：  \n",
    "1) 根据两个用户对活动打分的相似度；    \n",
    "2) 由用户本身的特征属性计算相似度。       \n",
    "再由已知用户对活动的评分(userEventScores‘用户对活动的兴趣’)计算用户对活动的推荐评分(兴趣)。 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 1) 根据两个用户对活动打分的相似度，预测用户对活动的兴趣\n",
    "\n",
    "userEventScores = np.mat(userEventScores)\n",
    "mean_userScores = np.mean(userEventScores,axis=1)        \n",
    "diff_userScores = userEventScores - mean_userScores       \n",
    "diff_userScores = np.array(diff_userScores)\n",
    "\n",
    "userInterestSim = ss.dok_matrix((n_users,n_users))       \n",
    "for i in range(n_users):             \n",
    "    userInterestSim[i,i] = 1.0                           \n",
    "    \n",
    "uniqueUserPairs = pickle.load(open('uniqueUserPairs.pkl','rb'))\n",
    "for u1,u2 in uniqueUserPairs:                           \n",
    "    i = u1 \n",
    "    j = u2 \n",
    "    if (i,j) not in userInterestSim.keys():                      \n",
    "        sim = sum(diff_userScores[i]*diff_userScores[j])\\\n",
    "        /np.sqrt(np.sum(np.power(diff_userScores[i],2)))/np.sqrt(np.sum(np.power(diff_userScores[j],2)))\n",
    "        userInterestSim[i,j] = sim                        \n",
    "        userInterestSim[j,i] = sim \n",
    "sio.mmwrite('userInterestSim',userInterestSim)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 由用户相似度计算用户对活动的推荐评分 \n",
    "\n",
    "user_similarity = sio.mmread('userInterestSim').todense()\n",
    "user_similarity = np.mat(user_similarity)\n",
    "user_similarity_sum =  np.sum(np.abs(user_similarity),axis=1)\n",
    "\n",
    "userEventScores = np.mat(userEventScores)\n",
    "mean_userScores = np.mean(userEventScores,axis=1)\n",
    "diff_userScores = userEventScores - mean_userScores \n",
    "\n",
    "userCFReco = mean_userScores + user_similarity * diff_userScores / user_similarity_sum\n",
    "sio.mmwrite('userCFReco',userCFReco)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "训练集分值总的误差平方和 (rmse of train data is) :  0.5179708945175653\n"
     ]
    }
   ],
   "source": [
    "# 采用RMSE(mean_squared_error，总的误差平方和)作为评测标准，用来比较预测分值与实际分值，数值越小算法越好\n",
    "\n",
    "user_Reco = sio.mmread('userCFReco') \n",
    "user_Reco = np.mat(user_Reco)\n",
    "\n",
    "rmse_sum = 0.0    \n",
    "for i in range(n):\n",
    "    u_id = userIndex[str(data[i,0])]                             \n",
    "    e_id = eventIndex[str(data[i,1])]                             \n",
    "    rat = int(data[i,2])                                          \n",
    "    eui = rat - user_Reco[u_id,e_id]                              \n",
    "    rmse_sum += eui**2                                           \n",
    "print('训练集分值总的误差平方和 (rmse of train data is) : ',np.sqrt(rmse_sum/n))  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 2) 根据用户本身的特征属性计算相似度，再计算活动的推荐度\n",
    "\n",
    "userProSim = ss.dok_matrix((n_users,n_users))                    \n",
    "for i in range(n_users):             \n",
    "    userProSim[i,i] = 1.0                                       \n",
    "    \n",
    "uniqueUserPairs = pickle.load(open('uniqueUserPairs.pkl','rb'))\n",
    "for u1,u2 in uniqueUserPairs:                                  \n",
    "    i = u1 \n",
    "    j = u2 \n",
    "    if (i,j) not in userProSim.keys():                 \n",
    "        sim = ssd.correlation(userMatrix[i,:],userMatrix[j,:])    \n",
    "        userProSim[i,j] = sim                                                     \n",
    "        userProSim[j,i] = sim \n",
    "sio.mmwrite('userProSim',userProSim)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 由用户相似度计算用户对活动的推荐评分 \n",
    "\n",
    "userPro_similarity = sio.mmread('userProSim').todense()\n",
    "userPro_similarity = np.mat(userPro_similarity)\n",
    "userPro_similarity_sum = np.sum(np.abs(userPro_similarity),axis=1)\n",
    "\n",
    "userEventScores = np.mat(userEventScores)\n",
    "mean_userScores = np.mean(userEventScores,axis=1)\n",
    "diff_userScores = userEventScores - mean_userScores \n",
    "\n",
    "userReco = mean_userScores + userPro_similarity * diff_userScores / userPro_similarity_sum\n",
    "sio.mmwrite('userReco',userReco)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "训练集分值总的误差平方和 (rmse of train data is) :  0.51793963498875\n"
     ]
    }
   ],
   "source": [
    "# 采用RMSE(mean_squared_error，总的误差平方和)作为评测标准，用来比较预测分值与实际分值，数值越小算法越好\n",
    "\n",
    "user_Reco = sio.mmread('userReco') \n",
    "user_Reco = np.mat(user_Reco)\n",
    "\n",
    "rmse_sum = 0.0    \n",
    "for i in range(n):\n",
    "    u_id = userIndex[str(data[i,0])]                             \n",
    "    e_id = eventIndex[str(data[i,1])]                           \n",
    "    rat = int(data[i,2])                                         \n",
    "    eui = rat - user_Reco[u_id,e_id]                             \n",
    "    rmse_sum += eui**2                                            \n",
    "print('训练集分值总的误差平方和 (rmse of train data is) : ',np.sqrt(rmse_sum/n))  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "小结3：  \n",
    "基于用户的协同过滤，用两种方法计算两个用户user_1和user_2之间的相似度，再预测用户对活动的评分，采用RMSE(总的误差平方和)作为评测标准，比较了训练集的预测分值与实际分值的误差平方和：     \n",
    "1) 根据两个用户对所有活动的打分，用Pearson相关系数(公式的第三种形式)计算两个用户的相似度，再由用户相似度计算用户对活动的推荐评分(userCFReco.mtx文件)，得到在训练集上预测分值与实际分值的误差平方和为：0.51797。   \n",
    "2) 根据用户本身的特征属性，用Pearson相关系数(公式的第一种形式)计算两个用户的相似度，再由用户相似度计算用户对活动的推荐评分(userReco.mtx文件)，得到在训练集上预测分值与实际分值的误差平方和为：0.51794。          \n",
    "由两种方法计算相似度得到的预测分值与实际分值的误差平方和相同。  \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "-------------------------------------------\n",
    "# 4.  实现基于活动的协同过滤（活动本身，用户活动关联分析）\n",
    "基于活动的协同过滤，用两种方法计算两个活动event_1和event_2之间的相似度，分别是：  \n",
    "1) 根据两个活动打分的相似度；    \n",
    "2) 由活动本身的特征属性计算相似度。       \n",
    "再由已知用户对活动的评分(userEventScores‘用户对活动的兴趣’)计算用户对活动的推荐评分(兴趣)。 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 1) 根据用户对两个活动打分的相似度，计算活动的推荐度\n",
    "\n",
    "userEventScores = np.mat(userEventScores)\n",
    "mean_userScores = np.mean(userEventScores,axis=1)\n",
    "diff_userScores = userEventScores - mean_userScores \n",
    "diff_userScores = np.array(diff_userScores)\n",
    "\n",
    "eventInterestSim = ss.dok_matrix((n_items,n_items))\n",
    "for i in range(n_items):             \n",
    "    eventInterestSim[i,i] = 1.0                                         \n",
    "    \n",
    "uniqueEventPairs = pickle.load(open('uniuqeEventPairs.pkl','rb'))\n",
    "for e1,e2 in uniqueEventPairs:                                        \n",
    "    i = e1 \n",
    "    j = e2 \n",
    "    if (i,j) not in eventInterestSim.keys():                                \n",
    "        sim = sum(diff_userScores[:,i]*diff_userScores[:,j])\\\n",
    "        /np.sqrt(np.sum(np.power(diff_userScores[:,i],2)))/np.sqrt(np.sum(np.power(diff_userScores[:,j],2)))\n",
    "        eventInterestSim[i,j] = sim                        \n",
    "        eventInterestSim[j,i] = sim \n",
    "sio.mmwrite('eventInterestSim',eventInterestSim) \n",
    "   "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 由活动相似度计算用户对活动的推荐评分 \n",
    "\n",
    "event_similarity = sio.mmread('eventInterestSim').todense() \n",
    "event_similarity = np.mat(event_similarity)\n",
    "event_similarity_sum = np.sum(np.abs(event_similarity),axis=0)\n",
    "\n",
    "userEventScores = np.mat(userEventScores)\n",
    "mean_userScores = np.mean(userEventScores,axis=1)\n",
    "diff_userScores = userEventScores - mean_userScores \n",
    "\n",
    "eventCFReco = mean_userScores + diff_userScores * event_similarity / event_similarity_sum\n",
    "sio.mmwrite('eventCFReco',eventCFReco)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "训练集上预测分值与实际分值的误差平方和 (rmse of train data is) :  0.5181074097428295\n"
     ]
    }
   ],
   "source": [
    "# 采用RMSE(mean_squared_error，总的误差平方和)作为评测标准，用来比较预测分值与实际分值，数值越小算法越好\n",
    "\n",
    "event_Reco = sio.mmread('eventCFReco') \n",
    "event_Reco = np.mat(event_Reco)\n",
    "\n",
    "rmse_sum = 0.0    \n",
    "for i in range(n):\n",
    "    u_id = userIndex[str(data[i,0])]                             \n",
    "    e_id = eventIndex[str(data[i,1])]                              \n",
    "    rat = int(data[i,2])                                           \n",
    "    eui = rat - event_Reco[u_id,e_id]                              \n",
    "    rmse_sum += eui**2                                             \n",
    "print('训练集上预测分值与实际分值的误差平方和 (rmse of train data is) : ',np.sqrt(rmse_sum/n))  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 2) 根据活动本身的特征属性计算相似度，再计算活动的推荐度\n",
    "\n",
    "# 忽略错误信息\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "\n",
    "eventPropMatrix = np.mat(eventPropMatrix)           \n",
    "mean_eventProp = np.mean(eventPropMatrix,axis=1)\n",
    "diff_eventProp = eventPropMatrix - mean_eventProp\n",
    "diff_eventProp = np.array(diff_eventProp)\n",
    "\n",
    "eventContMatrix = np.mat(eventContMatrix)            \n",
    "mean_eventCont = np.mean(eventContMatrix,axis=1)\n",
    "diff_eventCont = eventContMatrix - mean_eventCont\n",
    "diff_eventCont = np.array(diff_eventCont)\n",
    "\n",
    "eventPropSim = ss.dok_matrix((n_items, n_items))     \n",
    "eventContSim = ss.dok_matrix((n_items, n_items))    \n",
    "\n",
    "for i in range(n_items):             \n",
    "    eventPropSim[i,i] = 1.0                        \n",
    "    eventContSim[i,i] = 1.0   \n",
    "\n",
    "uniqueEventPairs = pickle.load(open('uniqueEventPairs.pkl', 'rb'))         \n",
    "for e1, e2 in uniqueEventPairs:\n",
    "    i = e1\n",
    "    j = e2  \n",
    "    \n",
    "    if (i,j) not in eventPropSim.keys():        \n",
    "        epsim = sum(diff_eventProp[i]*diff_eventProp[j])\\\n",
    "        /np.sqrt(np.sum(np.power(diff_eventProp[i],2)))/np.sqrt(np.sum(np.power(diff_eventProp[j],2)))# 采用Person相关系数作为相似度\n",
    "        eventPropSim[i,j] = epsim\n",
    "        eventPropSim[j,i] = epsim\n",
    "        \n",
    "    if (i,j) not in eventContSim.keys():      \n",
    "        ecsim = sum(diff_eventCont[i]*diff_eventCont[j])\\\n",
    "        /np.sqrt(np.sum(np.power(diff_eventCont[i],2)))/np.sqrt(np.sum(np.power(diff_eventCont[j],2)))# 采用Person相关系数作为相似度\n",
    "        eventContSim[i,j] = ecsim\n",
    "        eventContSim[j,i] = ecsim\n",
    "\n",
    "sio.mmwrite('eventPropSimilarity', eventPropSim)\n",
    "sio.mmwrite('eventContSimilarity', eventContSim)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 由活动的非词频相似度计算用户对活动的推荐评分 \n",
    "\n",
    "eventProp_sim = sio.mmread('eventPropSimilarity').todense()\n",
    "eventProp_similarity = np.nan_to_num(eventProp_sim,0)\n",
    "eventProp_similarity = np.mat(eventProp_similarity)\n",
    "eventProp_similarity_sum = np.sum(np.abs(eventProp_similarity),axis=0)\n",
    "\n",
    "userEventScores = np.mat(userEventScores)\n",
    "mean_userScores = np.mean(userEventScores,axis=1)\n",
    "diff_userScores = userEventScores - mean_userScores \n",
    "\n",
    "eventPropReco = mean_userScores + diff_userScores * eventProp_similarity / eventProp_similarity_sum\n",
    "sio.mmwrite('eventPropReco',eventPropReco)\n",
    " "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 由活动的词频相似度计算用户对活动的推荐评分 \n",
    "\n",
    "eventCont_sim = sio.mmread('eventContSimilarity').todense() \n",
    "eventCont_similarity = np.nan_to_num(eventCont_sim,0)\n",
    "eventCont_similarity = np.mat(eventCont_similarity)\n",
    "eventCont_similarity_sum = np.sum(np.abs(eventCont_similarity),axis=0)\n",
    "\n",
    "eventContReco = mean_userScores + diff_userScores * eventCont_similarity / eventCont_similarity_sum\n",
    "sio.mmwrite('eventContReco',eventContReco)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "用非词频特征得到的预测分值与实际分值的误差平方和 (rmse of train data is) :  0.5180217304796338\n",
      "用词频特征得到的预测分值与实际分值的误差平方和 (rmse of train data is) :  0.5180217442246566\n"
     ]
    }
   ],
   "source": [
    "# 采用RMSE(mean_squared_error，总的误差平方和)作为评测标准，用来比较预测分值与实际分值，数值越小算法越好\n",
    "\n",
    "eventProp_Reco = sio.mmread('eventPropReco') \n",
    "eventProp_Reco = np.mat(eventProp_Reco)\n",
    "\n",
    "eventCont_Reco = sio.mmread('eventContReco')\n",
    "eventCont_Reco = np.mat(eventCont_Reco)\n",
    "\n",
    "rmse_sum_Prop = 0.0  \n",
    "rmse_sum_Cont = 0.0\n",
    "for i in range(n):\n",
    "    u_id = userIndex[str(data[i,0])]                              \n",
    "    e_id = eventIndex[str(data[i,1])]                              \n",
    "    rat = int(data[i,2])                                           \n",
    "    eui_Prop = rat - eventProp_Reco[u_id,e_id]                    \n",
    "    eui_Cont = rat - eventCont_Reco[u_id,e_id]\n",
    "    rmse_sum_Prop += eui_Prop**2                                 \n",
    "    rmse_sum_Cont += eui_Cont**2\n",
    "\n",
    "print('用非词频特征得到的预测分值与实际分值的误差平方和 (rmse of train data is) : ',np.sqrt(rmse_sum_Prop/n)) \n",
    "print('用词频特征得到的预测分值与实际分值的误差平方和 (rmse of train data is) : ',np.sqrt(rmse_sum_Cont/n))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "小结4：  \n",
    "基于活动的协同过滤，用两种方法计算两个活动event_1和event_2之间的相似度，再预测用户对活动的评分，采用RMSE(总的误差平方和)作为评测标准，比较了训练集的预测分值与实际分值的误差平方和：       \n",
    "1) 根据所有用户对两个不同活动的打分，用Pearson相关系数计算两个活动的相似度，再由活动相似度计算用户对活动的推荐评分(eventCFReco.mtx文件)，得到在训练集上预测分值与实际分值的误差平方和为：0.5181。      \n",
    "2) 根据活动本身的特征属性，分别对非词频特征和词频特征用Pearson相关系数计算两个活动的相似度，再由活动相似度计算用户对活动的推荐评分(eventPropReco.mtx和eventContReco.mtx文件)，得到非词频特征和词频特征在训练集上预测分值与实际分值的误差平方和分别为：0.5180和0.5180 (由上面运行结果看到从小数的第8位数字才开始不一样)。     \n",
    "由两种方法计算相似度得到的预测分值与实际分值的误差平方和相同，并且根据不同的特征属性，非词频特征和词频特征得到的误差平方和也是相同的。\n",
    " "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "-------------------------------------------\n",
    "# 5.  实现基于模型的协同过滤"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 定义一个函数，用来预测用户u_id对活动e_id的打分    \n",
    "\n",
    "def pred_SVD(u_id,e_id,bu,bi,puk,qki,ave):   \n",
    "    ans = ave + bu[u_id] + bi[e_id] + np.dot(puk[u_id,:],qki[:,e_id])\n",
    "    if ans > 1:\n",
    "        return 1 \n",
    "    elif ans < 0:\n",
    "        return 0 \n",
    "    return ans"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "when iter =  0 , rmse of train data is:  0.6529231104037342\n",
      "when iter =  1 , rmse of train data is:  0.3947279776024093\n",
      "when iter =  2 , rmse of train data is:  0.33922540893061015\n",
      "when iter =  3 , rmse of train data is:  0.3077794366825062\n",
      "when iter =  4 , rmse of train data is:  0.28641556707061094\n",
      "when iter =  5 , rmse of train data is:  0.2710218740318645\n",
      "when iter =  6 , rmse of train data is:  0.25944647969629103\n",
      "when iter =  7 , rmse of train data is:  0.2504978507947821\n",
      "when iter =  8 , rmse of train data is:  0.24339661903671095\n",
      "when iter =  9 , rmse of train data is:  0.2376389980894195\n",
      "when iter =  10 , rmse of train data is:  0.23288402773507164\n",
      "when iter =  11 , rmse of train data is:  0.22888921472666643\n",
      "when iter =  12 , rmse of train data is:  0.22548020782873723\n",
      "when iter =  13 , rmse of train data is:  0.22253196777965667\n",
      "when iter =  14 , rmse of train data is:  0.21995226759974704\n",
      "when iter =  15 , rmse of train data is:  0.21767027132417502\n",
      "when iter =  16 , rmse of train data is:  0.21563355460581898\n",
      "when iter =  17 , rmse of train data is:  0.21380084648808928\n",
      "when iter =  18 , rmse of train data is:  0.21213999882777165\n",
      "when iter =  19 , rmse of train data is:  0.2106253383693958\n"
     ]
    }
   ],
   "source": [
    "# 训练模型 \n",
    "\n",
    "np.random.seed(1)\n",
    "k = 20 \n",
    "bu = np.zeros(n_users)                            \n",
    "bi = np.zeros(n_items)                          \n",
    "puk = random((n_users,k))/10*(np.sqrt(k))     \n",
    "qki = random((k,n_items))/10*(np.sqrt(k))     \n",
    "ave = np.mean(data[:,2])                      \n",
    "\n",
    "maxIter = 20        \n",
    "gamma = 0.1        # 学习速度\n",
    "Lambda = 0.15      # 正则参数\n",
    "\n",
    "# 训练模型，采用RMSE(rmse)作为评测标准，数值越小算法越好\n",
    "rmse_iter = []\n",
    "for it in range(maxIter):  \n",
    "    rmse_sum = 0.0                                                       \n",
    "    for i in range(n):\n",
    "        u_id = userIndex[str(data[i,0])]                             \n",
    "        e_id = eventIndex[str(data[i,1])]                           \n",
    "        rat = int(data[i,2])                                       \n",
    "        eui = rat - pred_SVD(u_id,e_id,bu,bi,puk,qki,ave)         \n",
    "        rmse_sum += eui**2                                        \n",
    "        bu[u_id] += gamma*(eui-Lambda*bu[u_id])                \n",
    "        bi[e_id] += gamma*(eui-Lambda*bi[e_id]) \n",
    "        temp = puk[u_id,:]\n",
    "        puk[u_id,:] += gamma*(eui*qki[:,e_id]-Lambda*puk[u_id,:])\n",
    "        qki[:,e_id] += gamma*(eui*temp-Lambda*qki[:,e_id])\n",
    "    print('when iter = ',it,', rmse of train data is: ',np.sqrt(rmse_sum/n)) \n",
    "    rmse_iter.append(np.sqrt(rmse_sum/n))\n",
    "rmse_iter = np.array(rmse_iter)\n",
    "    \n",
    "# 由下面数据看到，随着迭代次数增大，误差平方和是下降的，并趋于稳定"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYUAAAFACAYAAABTBmBPAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvNQv5yAAAIABJREFUeJzt3XmYXHWd7/H3t9fqvZPe0tkXEjSByBKSiKJRIQOIyEXHC4yK4B1klOEOzL0zzPU+3Hl05o6jF2fUiwsKgzow6LgyTLwEXBgQCUkwBJKQhSSQztrpTm/pvft7/6jTlUqnOqlO+nRV+nxez1NPVZ3zq+pvTqr602f5nmPujoiICEBOpgsQEZHsoVAQEZEEhYKIiCQoFEREJEGhICIiCQoFERFJUCiIiEiCQkFERBIUCiIikpCX6QJGq7q62mfPnp3pMkREzirr168/7O41pxp31oXC7NmzWbduXabLEBE5q5jZG+mM0+YjERFJUCiIiEiCQkFERBIUCiIikqBQEBGRBIWCiIgkKBRERCRBoSAiIgkKBRERSYhMKGzZ38b3frcbd890KSIiWSsyofDc9sPc+/NNtHX3Z7oUEZGsFZlQqK+MAbC/tSvDlYiIZK/ohELFUCh0Z7gSEZHsFaFQKAJgf4tCQURkJJEJhZqyQnIMDmjzkYjIiCITCvm5OdSUFbJPm49EREYUmVCA+CakAwoFEZERRSwUYjr6SETkJCIWCkXsb+1WA5uIyAgiFgoxOnsH1MAmIjKCaIWCGthERE4qWqGgBjYRkZOKVChMUQObiMhJRSoUatXAJiJyUpEKhaEGNm0+EhFJLVKhAMcOSxURkRNFMBTUwCYiMpIIhoIa2ERERhLBUFADm4jISCIXClMq1MAmIjKSyIXC1Eo1sImIjCRyoTDUwKZTaIuInChyoTDUwLa/RZuPRESGi1woqIFNRGRkkQsFUAObiMhIIhoKamATEUklkqEwpSKmBjYRkRRCDQUzu9LMtprZDjO7Z4QxHzGzzWa2ycweDbOeIVMritTAJiKSQl5Yb2xmucD9wBVAA7DWzB53981JY+YDfwW8w92PmFltWPUkG2pgO9DaTUVR/nj8SBGRs0KYawpLgR3uvtPde4HHgA8OG/PHwP3ufgTA3Q+FWE/CUAPbPu1XEBE5TpihMA3Yk/S8IZiWbAGwwMx+a2YvmNmVqd7IzG4zs3Vmtq6xsfGMC1MDm4hIamGGgqWYNnzPbh4wH1gB3Ah8x8wqT3iR+wPuvsTdl9TU1JxxYWpgExFJLcxQaABmJD2fDuxLMebn7t7n7ruArcRDIlRqYBMRSS3MUFgLzDezOWZWANwAPD5szM+A9wCYWTXxzUk7Q6wpYYoa2EREThBaKLh7P3AH8CSwBfihu28ys8+Z2bXBsCeBJjPbDPwa+O/u3hRWTcmmqoFNROQEoR2SCuDuq4BVw6bdm/TYgbuD27iaUhHjmW2NuDtmqXZ/iIhETyQ7mkENbCIiqUQ2FJIb2EREJC6yoaAGNhGRE0U2FNTAJiJyosiGQm1ZIaYGNhGR40Q2FPJzc6hVA5uIyHEiGwoQ34R0oE2hICIyJNKhMLUixj5tPhIRSYh0KOgKbCIix4t0KKiBTUTkeJEOBTWwiYgcL9KhUF+hBjYRkWTRDoVKNbCJiCSLdCgkGtgUCiIiQMRDIdHApsNSRUSAiIcCqIFNRCRZ5ENBDWwiIsdEPhTUwCYickzkQ6G+IqYGNhGRgEJB11UQEUlQKAQNbPvVwCYiolAYamBTr4KIiEJBDWwiIkkiHwpqYBMROSbyoQBqYBMRGaJQAOrL1cAmIgIKBQDqK9XAJiICCgXgWANbe48a2EQk2hQKHGtg29+i/QoiEm0KBdTAJiIyRKGAGthERIYoFFADm4jIEIUCamATERmiUAiogU1ERKGQUF8e0+YjEYk8hUKgvjLG/pYuNbCJSKQpFAL1FTGOqoFNRCJOoRBQA5uIiEIhQQ1sIiIKhQQ1sImIKBQS1MAmIqJQSMjPzaGmtJAD2nwkIhGmUEhSX1mkNQURiTSFQhI1sIlI1CkUkqiBTUSiTqGQRA1sIhJ1oYaCmV1pZlvNbIeZ3ZNi/ifMrNHMNgS3/xJmPaeiBjYRibrQQsHMcoH7gauAhcCNZrYwxdAfuPsFwe07YdWTDjWwiUjUhbmmsBTY4e473b0XeAz4YIg/74xNCULhgHY2i0hEhRkK04A9Sc8bgmnDfcjMNprZj8xsRoj1nFJdeQwz2KdQEJGICjMULMW04Yf1/Bsw290XA08D3035Rma3mdk6M1vX2Ng4xmUeowY2EYm6MEOhAUj+y386sC95gLs3uXtP8PTbwMWp3sjdH3D3Je6+pKamJpRih6iBTUSiLMxQWAvMN7M5ZlYA3AA8njzAzOqTnl4LbAmxnrSogU1EoiwvrDd2934zuwN4EsgFHnL3TWb2OWCduz8O3Glm1wL9QDPwibDqSVd9ZYxntzfi7pil2gImIjJxhRYKAO6+Clg1bNq9SY//CvirMGsYreQGtvJYfqbLEREZV+poHmZK0MCmw1JFJIoUCsNMDXoV9rXoCCQRiR6FwjBqYBORKFMoDKMGNhGJMoXCMGpgE5EoUyikoAY2EYkqhUIKamATkahSKKQwpSKmHc0iEkkKhRSmVsbo6Omnrbsv06WIiIwrhUIKamATkahSKKSgBjYRiSqFQgpqYBORqFIopKAGNhGJKoVCCmpgE5GoUiiMoL5CvQoiEj0KhRHUV6irWUSiR6EwAjWwiUgUKRRGoAY2EYkihcII1MAmIlGkUBiBGthEJIoUCiNQA5uIRJFCYQRqYBORKFIojEANbCISRWmFgsV91MzuDZ7PNLOl4ZaWeWpgE5GoSXdN4evA24Ebg+ftwP2hVJRF1MAmIlGTbigsc/fPAN0A7n4EKAitqiyhBjYRiZp0Q6HPzHIBBzCzGmAwtKqyhBrYRCRq0g2FrwI/BWrN7G+B54D/HVpVWUINbCISNXnpDHL3R8xsPfA+wIDr3H1LqJVlgfqkBrYFdWUZrkZEJHzpHn00D9jl7vcDrwJXmFllqJVlgXo1sIlIxKS7+ejHwICZnQN8B5gDPBpaVVliqIFNRyCJSFSkGwqD7t4PXA98xd3vAurDKys7DDWw7VcDm4hExGiOProR+DjwRDAtP5ySsosa2EQkStINhVuIN6/9rbvvMrM5wD+HV1b2UAObiERJukcfbQbuTHq+C/hCWEVlkykVMZ7bcTjTZYiIjIt0jz66xsx+b2bNZtZmZu1m1hZ2cdmgvkINbCISHWmtKQD/SHwn8yvu7iHWk3XqK481sJXHIrEbRUQiLN19CnuAV6MWCHCsV0H7FUQkCtJdU/gLYJWZPQP0DE109y+HUlUWSYSCLsspIhGQbij8LdABxIjA2VGTqYFNRKIk3VCY7O4rQ60kS6mBTUSiJN19Ck+bWSRDAdTAJiLRccpQMDMjvk/h/5lZV9QOSYV4r4JCQUSi4JShEBxxtMHdc9y9yN3L3b3M3cvHob6sUF9RpDOlikgkpLv56HdmdkmolWSxoQa2djWwicgEl+6O5vcAt5vZbuAo8QvtuLsvDquwbDLUwLa/tZsyNbCJyASWbihcFWoVWS65gU1XYBORiSytzUfu/kaq26leZ2ZXmtlWM9thZvecZNyHzczNbMloih8vamATkahId5/CqJlZLnA/8bWMhcCNZrYwxbgy4mdgXRNWLWeqtkwNbCISDaGFArAU2OHuO929F3gM+GCKcZ8Hvghk7W/cgrwcqtXAJiIREGYoTCN+Ir0hDcG0BDO7EJjh7k+Q5aaqV0FEIiDMULAU0xJnWTWzHOAfgD8/5RuZ3WZm68xsXWNj4xiWmL4pFTH1KojIhBdmKDQAM5KeTwf2JT0vA84DfhMc6roceDzVzmZ3f8Ddl7j7kpqamhBLHpkuyykiURBmKKwF5pvZHDMrAG4AHh+a6e6t7l7t7rPdfTbwAnCtu68LsabTNn1SER09/bzRdDTTpYiIhCa0UHD3fuAO4ElgC/BDd99kZp8zs2vD+rlhuWbxVArzcvjar3ZkuhQRkdCk27x2Wtx9FbBq2LR7Rxi7IsxaztSUihgfWz6Lh367i9vfPY9zakszXZKIyJgLc/PRhHP7innE8nP5x6e3ZboUEZFQKBRGobq0kFveMZsnNu5ny/7InDlcRCJEoTBKt102j7JYHl9+SmsLIjLxKBRGqaI4n9sum8tTmw+yYU9LpssRERlTCoXTcMs75zCpOJ/7Vm/NdCkiImNKoXAaSgvz+JMV83h2+2HW7GzKdDkiImNGoXCaPrZ8NjVlhdy3ehvxK5aKiJz9FAqnqagglz997zm8uLuZZ7cfznQ5IiJjQqFwBv7zJTOYVlnEfau3am1BRCYEhcIZKMzL5c73ncPLDa08veVQpssRETljCoUz9KGLpjO7qpj7Vm9lcFBrCyJydlMonKG83BzuumIBrx1o599f2Z/pckREzohCYQxcs3gqC+pK+Yent9E/MJjpckRETptCYQzk5hh3X7GAnY1H+dmGfad+gYhIllIojJE/WDSF86aV85VfbqO3X2sLInJ2UiiMETPjz1eey57mLn64bk+myxEROS0KhTG0YkENF8+axNd+tZ3uvoFMlyMiMmoKhTEUX1tYwMG2Hh5Z82amyxERGTWFwhi7dF417zinim/8ZgdHe/ozXY6IyKgoFEJw9xXncrijl4ef353pUkRERkWhEIKLZ03ivW+p5VvPvE5rV1+myxERSZtCISR3X7GAtu5+HnxuV6ZLERFJm0IhJOdNq+Dq86fw4LM7aT7am+lyRETSolAI0V2XL6Czb4BvPfN6pksREUmLQiFE8+vKuO6CaXz3d7s51Nad6XJERE5JoRCyP7t8Pn0Dztd/o7UFEcl+CoWQzaoq4SNLpvPomjfZ29KV6XJERE5KoTAO7njvfAC+9svtGa5EROTkFArjYFplETctm8m/rm/gmW2NmS5HRGRECoVxcvfKBSyoK+NP/nk9G/a0ZLocEZGUFArjpDyWz3dvvYTq0kJu+acX2XGoI9MliYicQKEwjmrLYnz/k0vJzTE+/uAa9rdqx7OIZBeFwjibVVXCw7cspa27n48/+CItnep2FpHsoVDIgPOmVfDAxy/mjaZObn14LV29uiCPiGQHhUKGXDqvmq/eeAEb9rTw6UfW0zeg6zqLSOYpFDLoyvPq+ZvrzufXWxv5yx9tZHDQM12SiERcXqYLiLqbls2kqaOH+57aRlVpAZ99/8JMlyQiEaZQyAJ3vPccDnf08O1nd1FdWsin3j0v0yWJSEQpFLKAmfG/PrCI5s4+/u4XrzG5pIA/XDIj02WJSAQpFLJETo5x3x++jZbOXu75yStMKi7g8oV1mS5LRCJGO5qzSEFeDt/46MUsmlrOZx59ibW7mzNdkohEjEIhy5QW5vFPn7iEaZVFfPLhtbx2oC3TJYlIhCgUslBVaSHf++RSigpyufmhF9nT3JnpkkQkIhQKWWr6pGK+d+syunoHuPmhF2nq6Ml0SSISAQqFLHbulDIe+sQl7G3p4paH19LR05/pkkRkglMoZLklsyfz9T+6iE372vjU99fR3t2X6ZJEZAJTKJwF3vfWOr74ocW8sLOZD3ztOV7d25rpkkRkggo1FMzsSjPbamY7zOyeFPNvN7NXzGyDmT1nZjrHwwg+dPF0HrttOT39g1z/9ef53u92465zJYnI2AotFMwsF7gfuApYCNyY4pf+o+5+vrtfAHwR+HJY9UwEl8yezL/feRnvnF/NvT/fxKcfeYnWLm1OEpGxE+aawlJgh7vvdPde4DHgg8kD3D35IPwSQH/6nsLkkgK+8/El/I+r38JTmw9yzdee5WVd81lExkiYoTAN2JP0vCGYdhwz+4yZvU58TeHOEOuZMHJyjNveNY8f3v52Bgfhw998ngef26XNSSJyxsIMBUsx7YTfWu5+v7vPA/4S+J8p38jsNjNbZ2brGhsbx7jMs9dFMyex6s7LWHFuLZ9/YjN//L31uryniJyRMEOhAUg+1ed0YN9Jxj8GXJdqhrs/4O5L3H1JTU3NGJZ49qsozueBj13Mvdcs5Jlth7j6K8+y/o0jmS5LRM5SYYbCWmC+mc0xswLgBuDx5AFmNj/p6fuB7SHWM2GZGbe+cw4/uv1ScnONj3zrd3zzmdd1JTcRGbXQQsHd+4E7gCeBLcAP3X2TmX3OzK4Nht1hZpvMbANwN3BzWPVEwdtmVPLvd17GHyyq4wu/eI1bv7tWp8cQkVGxs23n5JIlS3zdunWZLiOruTv/vOZNPv/EZiYV5/PVGy5k2dyqTJclIhlkZuvdfcmpxqmjeQIyMz62fBY//fSlFBfkceO3X+D//mq7NieJyCkpFCawRVMr+Lc/fSfXLJ7K/1m9jY89tIYdhzoyXZaIZDGFwgRXWpjHV264gL+7/nw2vNnCyn94hnt+vJEDrd2ZLk1EspBCIQLMjBuXzuQ//uI93HzpbH78UgPv/tKv+cIvXqO1U6fJEJFjtKM5gvY0d/Llp7bxsw17KY/l8+kV87j50tnE8nMzXZqIhCTdHc0KhQjbvK+NLz75Gr/Z2kh9RYy7Ll/A9RdNIy9XK5AiE42OPpJTWji1nIdvWcq//PFyastj/MWPN3LlV57lyU0HdB4lkYhSKAhvn1fFzz59Kd/86EUMDjqf+v56PvSN53lxV3OmSxORcaZQECC+M/rK8+pZfde7+Lvrz6fhSBcf+dbvuPXhtbx2oO3UbyAiE4L2KUhKXb0D/NPzu/jGb16no6ef6y+czl1XzGf6pOJMlyYip0E7mmVMtHT28vXfvM7Dz8cv/7ly0RT+aNlM3j63CrNUZ0cXkWykUJAxtbeliwef3cWPX2qgtauPudUl3Lh0Jh++eDqTSgoyXZ6InIJCQULR3TfAqlf28+iaN1n3xhEK8nK4+rwp3LRsFpfMnqS1B5EspVCQ0G090M6ja97gJy/tpb2nn/m1pdy0bCbXXzidiuL8TJcnIkkUCjJuOnv7eeLl/Tzy4pu8vKeFWH4O1yyeyk3LZnLhjEqtPYhkAYWCZMSre1t59MU3+fnv93K0d4C31pdz07KZXHfBVMpiWnsQyRSFgmRUR08/P9+wl0deeJPN+9soLsjlA4un8v7F9SyfW0VBnlpkRMaTQkGygrvzckMrj655gyc27qezd4Cywjze85ZarlhYx4pza7QGITIOFAqSdbr7BvjtjsOs3nSQp7ccpOloL/m5xqXzqlm5qI4r3lpHbXks02WKTEgKBclqA4POS28eYfWmA6zefJA3mjoBuHBmJSsXTmHlojrm1ZRmuEqRiUOhIGcNd2fbwQ6e2hwPiI0NrQDMqylh5aIprFxYx9umV5KTo6OYRE6XQkHOWvtaunh6y0FWbzrICzub6B90assKed9b67h0XhXL5k6mtkybmURGQ6EgE0JrZx+/3nqI1ZsP8B/bDtPR0w/A3JoSls+tYtmcySyfW0Wd9kWInJRCQSac/oFBXt3XxpqdTazZ1czaXc20ByExp7okERDL5k6mvqIow9WKZBeFgkx4A4PO5n1tvLCziTW7mnhxVzNt3fGQmDm5mOVzJ7NsThXL51UxrVIhIdGmUJDIGRh0tuxvY82uZl7YGQ+J1q4+AKZPKmLZnCreNqOC86ZVsLC+nFh+boYrFhk/CgWJvMFB57UD7azZ1cSanc2s3d1M09FeAHJzjPm1pSyeXsH50+JB8VYFhUxgCgWRYdydfa3dvNLQyqt7W9m4N37fHARFXo4xv66MxdMqOC8Ii7dMKVNQyISgUBBJg7uzt6WLV/e28sreVjYGgXGkM77ZKS/HWFBXxuLpFSyaVsG5dWUsqCulslgXFpKzi0JB5DS5Ow1HjgXF0K0lCAqA6tJCFtSVMr+2lPl1ZcyvLWVBXZmuQidZK91QyBuPYkTOJmbGjMnFzJhczFXn1wPH1ii2H+pg+8F2th/sYNuhDn60voGjvQOJ11aXFjC/toz5wwKjqrQwU/8ckVFRKIikwcyYPqmY6ZOKec+5tYnpQ/sphoJi+6F2th3s4Ccv7U002gFUlRRwTm0pc6pLmFlVzOyqEmZVFTOrqoTSQn0NJXvo0yhyBsyMaZVFTKssYsWwsDjQ1s22g8fWLHY0dvD0loMc7ug97j2qSwuYVVXCrMnxkJhdXczMyfHgqCzO15XrZFwpFERCYGbUVxRRX1HEuxfUHDevo6efN5qO8kZTZ3A7yu6mo7yws4mf/H7vcWPLY3nxwKgqZlZVfE1lahBCUytjFBfoKyxjS58okXFWWpjHoqkVLJpaccK87r4BGo50svtwJ7ubjvJmcye7mzp5dW8rv3j1AAODxx8YMqk4n2mTiphaUcS0SUWJtZapwa26tEBrGjIqCgWRLBLLz+Wc2jLOqS07YV7/wCCH2nvY19LF3qHbkS72tXSxu+kov91x+Lid3gAFeTlJQRFjSnmM2vIYdeXxx3XlhVSVFpKr05JLQKEgcpbIy81JrAGkOq7Q3Wnr7k8Exd6W+H1DcP/MtkYa23sYtrJBbo5RU1pIXXkhtUlhMTw8Koq0fyMKFAoiE4SZUVGUT0VRPgunlqcc0z8wSNPRXg62dXOgtZuD7T0cSnq8p7mTtbubj+vJGFKQl0NNaSHVZYXUlBZQXVoY3AqoLitMPK8pLaS8KE8BcpZSKIhESF5uDnXBGsDi6SOP6+4boLG9hwNt3Rxs6+ZgWw8H27o53N5DY0cPe1u6ebkhfoqQ4fs5AApyc6hKBEdwX1ZIVUkBk4oLmFxawOTiAiaXxG/FBbkKkSyhUBCRE8TycxMNfCczOOgc6ezlcEcvhzt6ONzRQ2N7z/HPO3rYsr+dpqM99A2kPoNCQV7OscAoOXZLDpBJJflMKi6gsjifyqICigp0TqowKBRE5LTl5BhVpfGd1edy4s7xZO5Oe08/zR29NHf2cuRoL01H4/fNwe1IZ3xaw5FOmo720t7dP+L7FeblJAKiojifyqJjoVERTK8MplcWx8dUFOVTorWSk1IoiMi4MDPKY/mUx/KZTUlar+ntH6SlKwiNjl5au/po6erjSGcvrZ19tHT20dLVS0tnH282d/JyQwtHOvvo7R8c8T1zc4zyWB7lwf6X8lg+5UV5SY+DWyyYFkyvKMqnLJZHYV7OhA4VhYKIZK2CvBxqy2LUlo3uGtzdfQPHBUZLZ/y+rbuP1q4+2rr64/fdfbR19bG/tYu27vi0kwUKQH6uURaLB0RZLI+ywqHHSdOOex7cF+ZRGsujtDCPkoI8crL0MGCFgohMOLH8XKZU5DKlYnRhAvFAGQqL1q5+2rqOhUl7d39w6zvu/s3mTtq7+2nr7qOjp590Tj5dUpBLSVJQlBbmUVIYD49U00sLc1k0teKU+3nOlEJBRCRJLD+XWH7uqNdOhgwOOkd7++noORYgbUGYHO2J39q74/OP9vTTHtx3dPfTfLQzMb2jp/+EHfN/c915fHT5rLH4Z45IoSAiMoZycoY2L+VTf+KZTEalp3+Aju5+jvYM0N7TR1356QXVaCgURESyVGFeLoWluVSVjt/PzAnzzc3sSjPbamY7zOyeFPPvNrPNZrbRzH5pZuGuF4mIyEmFFgpmlgvcD1wFLARuNLOFw4b9Hlji7ouBHwFfDKseERE5tTDXFJYCO9x9p7v3Ao8BH0we4O6/dvfO4OkLwEka70VEJGxhhsI0YE/S84Zg2kg+Cfwi1Qwzu83M1pnZusbGxjEsUUREkoUZCqk6M1IevWtmHwWWAF9KNd/dH3D3Je6+pKamJtUQEREZA2EefdQAzEh6Ph3YN3yQmV0OfBZ4t7v3hFiPiIicQphrCmuB+WY2x8wKgBuAx5MHmNmFwLeAa939UIi1iIhIGkILBXfvB+4AngS2AD90901m9jkzuzYY9iWgFPhXM9tgZo+P8HYiIjIOQm1ec/dVwKph0+5Nenx5mD9fRERGJ9TmNRERObuYp3M6vyxiZo3AG6f58mrg8BiWM9ZU35lRfWcu22tUfadvlruf8vDNsy4UzoSZrXP3JZmuYySq78yovjOX7TWqvvBp85GIiCQoFEREJCFqofBApgs4BdV3ZlTfmcv2GlVfyCK1T0FERE4uamsKIiJyEgoFERFJmJChkMYV3wrN7AfB/DVmNnsca5thZr82sy1mtsnM/muKMSvMrDU49ccGM7s31XuFWONuM3sl+NnrUsw3M/tqsPw2mtlF41jbuUnLZYOZtZnZnw0bM+7Lz8weMrNDZvZq0rTJZvaUmW0P7ieN8NqbgzHbzezmcartS2b2WvD/91MzqxzhtSf9LIRc41+b2d6k/8erR3jtSb/vIdb3g6TadpvZhhFeOy7LcMy4+4S6AbnA68BcoAB4GVg4bMyngW8Gj28AfjCO9dUDFwWPy4BtKepbATyRwWW4G6g+yfyriV/7woDlwJoM/l8fIN6Uk9HlB7wLuAh4NWnaF4F7gsf3AH+f4nWTgZ3B/aTg8aRxqG0lkBc8/vtUtaXzWQi5xr8G/lsan4GTft/Dqm/Y/PuAezO5DMfqNhHXFE55xbfg+XeDxz8C3mdmqa7/MObcfb+7vxQ8bid+ssCTXXwoG30Q+J7HvQBUmll9Bup4H/C6u59uh/uYcff/AJqHTU7+nH0XuC7FS/8AeMrdm939CPAUcGXYtbn7ao+ftBKy4KqHIyy/dKTzfT9jJ6sv+N3xEeBfxvrnZsJEDIV0rviWGBN8MVqBqnGpLkmw2epCYE2K2W83s5fN7BdmtmhcC4tfDGm1ma03s9tSzB/tVfXCcgMjfxEzufyG1Ln7foj/MQDUphiTDcvyVka46iGn/iyE7Y5gE9dDI2x+y4bldxlw0N23jzA/08twVCZiKKRzxbe0rwoXFjMrBX4M/Jm7tw2b/RLxTSJvA74G/Gw8awPe4e4XAVcBnzGzdw2bnw3LrwC4FvjXFLMzvfxGI6PL0sw+C/QDj4ww5FSfhTB9A5gHXADsJ76JZriMfxaBGzn5WkIml+GoTcRQSOeKb4kxZpYHVHB6q66nxczyiQfCI+7+k+Hz3b3N3TuCx6uAfDOrHq/63H1fcH8I+CnxVfRkaV1VL2T7tIaiAAACxElEQVRXAS+5+8HhMzK9/JIcHNqsFtynupBUxpZlsFP7GuCPPNj4PVwan4XQuPtBdx9w90Hg2yP87Ix+FoPfH9cDPxhpTCaX4emYiKFwyiu+Bc+HjvL4MPCrkb4UYy3Y/vggsMXdvzzCmClD+zjMbCnx/6emcaqvxMzKhh4T3yH56rBhjwMfD45CWg60Dm0mGUcj/nWWyeU3TPLn7Gbg5ynGPAmsNLNJweaRlcG0UJnZlcBfEr/qYecIY9L5LIRZY/J+qv80ws9O5/sepsuB19y9IdXMTC/D05LpPd1h3IgfHbON+FEJnw2mfY74FwAgRnyzww7gRWDuONb2TuKrtxuBDcHtauB24PZgzB3AJuJHUrwAXDqO9c0Nfu7LQQ1Dyy+5PgPuD5bvK8CScf7/LSb+S74iaVpGlx/xgNoP9BH/6/WTxPdT/RLYHtxPDsYuAb6T9Npbg8/iDuCWcaptB/Ft8UOfwaGj8aYCq072WRjH5ff94PO1kfgv+vrhNQbPT/i+j0d9wfSHhz53SWMzsgzH6qbTXIiISMJE3HwkIiKnSaEgIiIJCgUREUlQKIiISIJCQUREEhQKIqNgZs8H97PN7KZM1yMy1hQKIqPg7pcGD2cDowoFM8sd84JExphCQWQUzKwjePgF4LLgHPl3mVlucI2CtcEJ3D4VjF9h8etnPEq8EUskq+VlugCRs9Q9xM/1fw1AcPbLVne/xMwKgd+a2epg7FLgPHfflaFaRdKmUBAZGyuBxWb24eB5BTAf6AVeVCDI2UKhIDI2DPhTdz/uZHZmtgI4mpGKRE6D9imInJ524pdTHfIk8CfBadExswXBWTFFzipaUxA5PRuBfjN7mfiZMr9C/Iikl4LTdjeS+vKbIllNZ0kVEZEEbT4SEZEEhYKIiCQoFEREJEGhICIiCQoFERFJUCiIiEiCQkFERBL+P0/ZGPamspyMAAAAAElFTkSuQmCC\n",
      "text/plain": [
       "<matplotlib.figure.Figure at 0x179e2e10>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# 画出随着迭代次数的变化，误差平方和的变化\n",
    "\n",
    "plt.figure(figsize=(6,5))\n",
    "plt.plot(range(20),rmse_iter)\n",
    "plt.xlabel('iter')\n",
    "plt.ylabel('rmse')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Lambda =  1e-05  rmse of train data is:  0.006836769276528758\n",
      "Lambda =  0.0001  rmse of train data is:  0.00027284848003831293\n",
      "Lambda =  0.001  rmse of train data is:  0.0026544641135271547\n",
      "Lambda =  0.01  rmse of train data is:  0.020302653627157788\n",
      "Lambda =  0.1  rmse of train data is:  0.13759993707892676\n",
      "Lambda =  1  rmse of train data is:  0.3204613291775001\n",
      "Lambda =  10  rmse of train data is:  0.4213292605704985\n"
     ]
    }
   ],
   "source": [
    "# 在不同的正则参数下训练模型 \n",
    "\n",
    "# 初始化模型参数\n",
    "np.random.seed(1)\n",
    "k = 20 \n",
    "bu = np.zeros(n_users)                            \n",
    "bi = np.zeros(n_items)                          \n",
    "puk = random((n_users,k))/10*(np.sqrt(k))       \n",
    "qki = random((k,n_items))/10*(np.sqrt(k))      \n",
    "ave = np.mean(data[:,2])                     \n",
    "\n",
    "# 定义一个模型训练函数 \n",
    "def train_SVD(Lambda,bu,bi,puk,qki,ave):\n",
    "    gamma = 0.1   \n",
    "    maxIter = 20\n",
    "    for it in range(maxIter):  \n",
    "        rmse_sum = 0.0                                                   \n",
    "        for i in range(n):\n",
    "            u_id = userIndex[str(data[i,0])]                         \n",
    "            e_id = eventIndex[str(data[i,1])]                           \n",
    "            rat = int(data[i,2])                                       \n",
    "            eui = rat - pred_SVD(u_id,e_id,bu,bi,puk,qki,ave)          \n",
    "            rmse_sum += eui**2                                       \n",
    "            bu[u_id] += gamma*(eui-Lambda*bu[u_id])                 \n",
    "            bi[e_id] += gamma*(eui-Lambda*bi[e_id]) \n",
    "            temp = puk[u_id,:]\n",
    "            puk[u_id,:] += gamma*(eui*qki[:,e_id]-Lambda*puk[u_id,:])\n",
    "            qki[:,e_id] += gamma*(eui*temp-Lambda*qki[:,e_id])\n",
    "    print('Lambda = ',Lambda, 'rmse of train data is: ',np.sqrt(rmse_sum/n)) \n",
    "    return np.sqrt(rmse_sum/n)\n",
    "    \n",
    "Lam = [0.00001,0.0001,0.001,0.01,0.1,1,10]\n",
    "rmse = []\n",
    "for Lambda in Lam:\n",
    "    rmse_sum = train_SVD(Lambda,bu,bi,puk,qki,ave)\n",
    "    rmse.append(rmse_sum)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYUAAAFACAYAAABTBmBPAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvNQv5yAAAIABJREFUeJzt3XmcVPWZ7/HP0ys0S7M1IN3sEhUFXFrEHRMz0cQl0cRA9EbjQtQYcycz42RubszceHUyyUwyuRlhXGIcMxGSmI0xaMwioFEjIIhsKt1sDQjN0g1002s994+qbotOF1Qvp08t3/frxcs6VadOPdX9sr79O+dXv8fcHREREYCcsAsQEZHUoVAQEZF2CgUREWmnUBARkXYKBRERaadQEBGRdgoFERFpp1AQEZF2CgUREWmXF3YBXTVixAifMGFC2GWIiKSVVatW7XP3khPtl3ahMGHCBFauXBl2GSIiacXMtiWzn04fiYhIO4WCiIi0UyiIiEg7hYKIiLRTKIiISDuFgoiItFMoiIhIO4WCiIi0UyiIiEg7hYKISIpzd37w8hZq6psCfy2FgohIintkeSUPPLuBZ1ZVBf5aCgURkRS25K3dfPO5TVw1/SRuvXBi4K+nUBARSVGrtx/kr3+yhrPHDeFfPjWDnBwL/DUVCiIiKWjHgXrueGolIwcX8thny+mXn9snr6tQEBFJMbVHm7n1yRU0tkT44S3nMnxgYZ+9tkJBRCSFNLdG+MKP32DLvjoeuekcTh45qE9fP+2a7IiIZCp35/5fr+Plzfv41ienc8HJI/q8Bo0URERSxKPLK1n4+g6+cNlkbigfG0oNCgURkRTw3Fu7+afY1NO/+fApodWhUBARCdnq7Qf5nz9Zw1l9OPU0kUBDwcyuMLO3zWyzmX3lOPt90szczMqDrEdEJNWENfU0kcBCwcxygYeBK4GpwFwzm9rJfoOAe4E/B1WLiEgqOtRw7NTTEX049TSRIEcKM4HN7l7p7k3AIuDaTvZ7APgW0BBgLSIiKSXsqaeJBBkKpcCOuO2q2H3tzOwsYKy7P3u8A5nZPDNbaWYrq6ure79SEZE+1Db19KV39/HQJ6aFMvU0kSBDobMrJd7+oFkO8F3gb050IHd/1N3L3b28pKSkF0sUEel7bVNP7549mRvODWfqaSJBhkIVEP9uy4BdcduDgDOApWa2FZgFLNbFZhHJZG1TTz82/ST+9q/Cm3qaSJChsAKYYmYTzawAmAMsbnvQ3WvdfYS7T3D3CcBrwDXuvjLAmkREQrNmR0371NN/DXnqaSKBhYK7twD3AL8FNgI/dff1ZvYNM7smqNcVEUlFOw7Uc/t/rkiZqaeJBLr2kbsvAZZ0uO/+BPvODrIWEZGwxE89XTRvVkpMPU1E32gWEQlQqk49TUSrpIqIBCR+6um3rg9n1dOu0khBRCQgqTz1NBGFgohIANqnnk5LzamniSgURER62TFTT29IzamniSgURER6UdXBem7/z9RZ9bSrdKFZRKSXvD/1tJVF885L6amniWikICLSC9qmnlZW1/EfaTD1NBGNFEREeqjj1NML02DqaSIaKYiI9FDb1NO70mjqaSIKBRGRHnh+3W6++Xx06unfpdHU00QUCiIi3dQ29fTMsek39TQRhYKISDe0TT0dMTA9p54mogvNIiJdlAlTTxPRSEFEpAsyZeppIhopiIgkKTr1dD0vvbuPf75+WlpPPU1EIwURkSQ99lIlC1/fzl2zJ/Ppc8eFXU4gFAoiIkl4ft37q55mwtTTRBQKIiIn8GZs6umMssyZepqIQkFE5DiqDtZzW2zq6eM3Z87U00R0oVlEJIH4qacL78isqaeJaKQgItKJjlNPp4zKrKmniWikICLSQTZMPU1EIwURkQ6yYeppIgoFEZE42TL1NBGFgohITDZNPU1EoSAiwrFTTzNp1dOu0oVmEcl6HaeelgzK/KmniWikICJZLX7q6YIbs2fqaSIaKYhI1nJ3vr74/amnF03JnqmniWikICJZ67GXKnn6z9u589Lsm3qaiEJBRLJS29TTj04bzX0fyb6pp4koFEQk68RPPf3ODWdm5dTTRBQKIpJVNPX0+HShWUSyxqGGZm57cqWmnh6HRgoikhXapp5WVB/R1NPj0EhBRDJe/NTTb16nqafHo5GCiGS8x1/a0j71dM5MTT09HoWCiGS059e9x0PPbdTU0yQpFEQkY0Wnnq7W1NMuUCiISEZqaY3wxYWrGT5AU0+7QqEgIhnp2bW72X6gnvuvnqqpp12gUBCRjBOJOAuWVjBl5EA+fNqosMtJKwoFEck4f9y0l7f3HObOSyfrOkIXKRREJKO4O/OXbqZ0SH+uOXNM2OWkHYWCiGSUP285wBvba5h3ySTyc/UR11X6iYlIRpm/tILhAwq4oXxs2KWkJYWCiGSMdTtrWf5ONbdeNJH+BZqC2h0KBRHJGAuWVjCwMI+bZo0Pu5S0FWgomNkVZva2mW02s6908vidZvaWma0xs5fNbGqQ9YhI5qqsPsKSdbu5adZ4ivvnh11O2gosFMwsF3gYuBKYCszt5EP/aXef5u5nAt8CvhNUPSKS2R5ZVklBbg63XTQx7FLSWpAjhZnAZnevdPcmYBFwbfwO7n4obnMA4AHWIyIZanftUX6xuoobysfq28s9FGQ/hVJgR9x2FXBex53M7AvAl4EC4IMB1iMiGerxl7YQcZh3yaSwS0l7QY4UOvsa4V+MBNz9YXefDPw98L87PZDZPDNbaWYrq6ure7lMEUlnB+uaWPj6dq6ZMYaxw4rCLiftBRkKVUD8ROEyYNdx9l8EfLyzB9z9UXcvd/fykpKSXixRRNLdk69spb6plbtmTw67lIwQZCisAKaY2UQzKwDmAIvjdzCzKXGbHwPeDbAeEckwdY0tPPnKVi4/bRQfUM/lXhHYNQV3bzGze4DfArnAE+6+3sy+Aax098XAPWZ2OdAMHARuDqoeEck8C1/fTu3RZu6+TKOE3hLkhWbcfQmwpMN998fd/lKQry8imauxpZXHXqpk1qRhnD1uaNjlZAx9o1lE0tIv39jJnkON3D375LBLySgKBRFJO60R55HllZxROpiLp4wIu5yMolAQkbTz3LrdbNlXx92zT8ZMTXR6k0JBRNKKuzP/xQomjRjAR04fHXY5GUehICJpZdk71WzYfYg7L51Mrlpt9jqFgoiklflLKzipuB8fP6s07FIykkJBRNLGqm0HeH3LAW6/eBIFefr4CoJ+qiKSNua/WMHQonzmzlSrzaAoFEQkLWzcfYg/bNrLLRdMpKgg0O/dZjWFgoikhQVLKxhQkMvNF6jVZpAUCiKS8rbtr+PZtbv4zHnjGFJUEHY5GU2hICIp75HlleTl5HD7xWqiEzSFgoiktL2HGnhmZRXXn1PKqMH9wi4n4ykURCSl/eDlLbREInz+Ei2P3RcUCiKSsmrrm/mv17bx0WknMWHEgLDLyQoKBRFJWU+9upU6tdrsUwoFEUlJR5ta+eErW5l9SgmnjykOu5ysoVAQkZS0aMV2DtQ1qYlOH1MoiEjKaWqJ8NjySsrHD2XmxGFhl5NVFAoiknJ+vWYnu2obuPsyXUvoawoFEUkpkYjzH8sqOHX0IC47ZWTY5WQdhYKIpJQXNrxHRXUdd82erFabIVAoiEjKcHfmL61g3LAiPjbtpLDLyUoKBRFJGX/avJ+1VbV8/tJJ5OXq4ykM+qmLSMqYv3QzJYMKuf7ssrBLyVoKBRFJCWt21PBKxX5uv2gi/fJzwy4naykURCQlzH9xM4P75XHjLDXRCZNCQURC9+6ew7ywYQ83XzCBgYVqtRkmhYKIhG7Bsgr65+fyuQsnhl1K1lMoiEioqg7Ws3jNLubMHMuwAWq1GTaFgoiE6rHllZjBHWq1mRIUCiISmn1HGlm0YgcfP7OUMUP6h12OoFAQkRD98E9baGqNcKea6KQMhYKIhOJwQzNPvbqNK04fzeSSgWGXIzEKBREJxX+9tp3DDS1qopNiFAoi0ucamlv5wctbuHjKCKaVqdVmKlEoiEif+9mqKvYdaeQuXUtIOUmFgkXdZGb3x7bHmdnMYEsTkUzU0hrh0eUVnDl2COdPGh52OdJBsiOF+cD5wNzY9mHg4UAqEpGM9uza3ew4cJS71UQnJSW7yMh57n62ma0GcPeDZqavHopIl0QizoKlFUwZOZDLTxsVdjnSiWRHCs1mlgs4gJmVAJHAqhKRjPTHTXt5e89h7po9mZwcjRJSUbKh8P+AXwIjzexB4GXgocCqEpGME221uZnSIf25esaYsMuRBJI6feTuPzazVcCHAAM+7u4bA61MRDLKn7cc4I3tNXzj2tPJV6vNlJXs7KPJwBZ3fxhYB3zYzIYEWpmIZJT5SysYMbCAG8rHhl2KHEeycf1zoNXMTgYeByYCTwdWlYhklHU7a1n+TjWfu1CtNlNdsqEQcfcW4Drge+7+18BJwZUlIplkwdIKBhXm8T/OV6vNVNeV2Udzgc8Cz8buyw+mJBHJJJXVR1iybjc3nT+ewf30sZHqkg2FzxH98tqD7r7FzCYC/xVcWSKSKR5ZVklBbg63qtVmWkh29tEG4N647S3AN4MqSkQyw+7ao/xidRVzzh1HyaDCsMuRJCQ7++gqM1ttZgfM7JCZHTazQ0EXJyLp7fGXthBxmHeJWm2mi2RPH/0bcDMw3N0Hu/sgdx98oieZ2RVm9raZbTazr3Ty+JfNbIOZrTWzP5iZrkKJZIiDdU0sfH0718wYw9hhRWGXI0lKNhR2AOvc3ZM9cGxZjIeBK4GpwFwzm9pht9VAubtPB54BvpXs8UUktT35ylbqm1q1PHaaSXZBvPuAJWa2DGhsu9Pdv3Oc58wENrt7JYCZLQKuBTbEPf/FuP1fA25Ksh4RSWF1jS08+cpWLj9tFB8YNSjscqQLkh0pPAjUA/2AQXH/jqeU6AijTVXsvkRuA57r7AEzm2dmK81sZXV1dZIli0hYFr6+ndqjzdx9mUYJ6SbZkcIwd/+rLh67syUQOz39ZGY3AeXApZ097u6PAo8ClJeXJ30KS0T6XmNLK4+9VMmsScM4e9zQsMuRLkp2pPB7M+tqKFQB8YuclAG7Ou5kZpcDXwWucffGjo+LSHr55Rs72XOokbtnnxx2KdINJwwFi7ZGug943syOdmFK6gpgiplNjDXkmQMs7nDss4BHiAbC3u69BRFJFa0R55HllZxROpiLp4wIuxzphhOGQmzG0Rp3z3H3/slOSY2tlXQP8FtgI/BTd19vZt8ws2tiu30bGAj8zMzWmNniBIcTkTTw3LrdbNlXx92zT1arzTSV7DWFV83sXHdf0ZWDu/sSYEmH++6Pu315V44nIqnL3Zn/YgWTRgzgI6ePDrsc6aZkrylcBrxmZhWxL5q9ZWZrgyxMRNLLsneq2bD7EHdeOplctdpMW8mOFK4MtAoRSXvzl1ZwUnE/Pn7W8WaeS6pLdkG8bUEXIiLpa9W2A7y+5QBfu2oqBXlqtZnO9NsTkR6b/2IFQ4vymTtTrTbTnUJBRHpk03uH+MOmvdxywUSKCpI9Iy2pSqEgIj2yYGkFAwpyufkCLXKcCRQKItJt2/fX899v7uLGWeMZUlQQdjnSCxQKItJtjyyvIC8nh9suUqvNTKFQEJFu2Xu4gZ+tquL6c8oYNbhf2OVIL1EoiEi3/ODlLbS0RrjzUrXazCQKBRHpstqjzfz4te18bPoYxg8fEHY50osUCiLSZT96dStHGlu461I10ck0CgUR6ZKjTa088aetXHZKCVPHHHexZElDCgUR6ZJFK7ZzoK6Juy9TE51MpFAQkaQ1tUR4bHkl504YyrkThoVdjgRAoSAiSfv1mp3sqm1Qq80MplAQkaREIs5/LKvgtJMGM/uUkrDLkYAoFEQkKS9seI+K6jrumj1ZrTYzmEJBRE7I3Zm/tILxw4v46BlqtZnJFAoickJ/2ryftVW1fP6SyeTl6mMjk+m3KyInNH/pZkYOKuT6c9RqM9MpFETkuNbsqOGViv3cfvFECvNywy5HAqZQEJHjmv/iZor75/OZ89REJxsoFEQkoXf3HOaFDXu4+fzxDCxUq81soFAQkYQWLKugf34ut1yoJjrZQqEgIp2qOljP4jW7mDNzLMMGqNVmtlAoiEinHlteiRnccbGa6GQThYKI/IV9RxpZtGIHHz+zlDFD+oddjvQhhYKI/IUf/mkLTa0R7pytJjrZRqEgIsc43NDMU69u44rTRzO5ZGDY5UgfUyiIyDEef2kLhxtatDx2llIoiEi7VdsO8O8vbubqGWOYVlYcdjkSAoWCiABQU9/EvQvXUDqkPw9+4oywy5GQ6CuKIoK7c98za9l7uIFn7ryAwf3ywy5JQqKRgojwo9e28cKGPfz9FacyY+yQsMuRECkURLLc+l21/N9nN3LZKSXcquUssp5CQSSL1TW28MWnVzN0QD7/esOZ5OSozWa20zUFkSz2tV+vY+v+Op6+Y5bWNxJAIwWRrPXzVVX84o2dfPGDU5g1aXjY5UiKUCiIZKGK6iN87dfrOG/iMO790JSwy5EUolAQyTINza3c8/RqCvNy+N6cs8jVdQSJo2sKIlnmoSUb2bj7EE/cUs7o4n5hlyMpRiMFkSzy/Lr3eOrVbdx+0UQ+eOqosMuRFKRQEMkSVQfrue+ZN5leVsx9V5wadjmSohQKIlmguTXCvQtXE3H4/tyzKMjT//rSOV1TEMkC3/3dO7yxvYbvzz2L8cMHhF2OpDD9uSCS4V56t5oFyyqYO3MsV88YE3Y5kuIUCiIZbO/hBv76J2uYMnIg9191etjlSBrQ6SORDBWJOF/+yZscaWzh6Ttm0b8gN+ySJA1opCCSoRYsq+Dlzfv4+tWn84FRg8IuR9JEoKFgZleY2dtmttnMvtLJ45eY2Rtm1mJmnwyyFpFssnLrAb7zu3e4avpJzDl3bNjlSBoJLBTMLBd4GLgSmArMNbOpHXbbDtwCPB1UHSLZpqa+iS8tirbVfOi6aZhpGQtJXpDXFGYCm929EsDMFgHXAhvadnD3rbHHIgHWIZI11FZTeirI00elwI647arYfV1mZvPMbKWZrayuru6V4kQy0VOvqq2m9EyQodDZmNW7cyB3f9Tdy929vKSkpIdliWSm9btqefA3G/ngqSO57SK11ZTuCTIUqoD4K1xlwK4AX08ka8W31fyXT83QdQTptiBDYQUwxcwmmlkBMAdYHODriWSttraa35tzltpqSo8EFgru3gLcA/wW2Aj81N3Xm9k3zOwaADM718yqgE8Bj5jZ+qDqEclUbW017/2Q2mpKzwX6jWZ3XwIs6XDf/XG3VxA9rSQi3RDfVvOLH1RbTek5faNZJE21tdXsl5+rtprSa7T2kUiaUltNCYJGCiJp6Pl1u9VWUwKhUBBJM9G2mmuZobaaEgCFgkgaaWur6Q7fn3u22mpKr9M1BZE08p24tprjhheFXY5kIP2ZIZImlr9TzYKlaqspwVIoiKSBvYcb+PJP1/CBUWqrKcHS6SORFKe2mtKXFAoiKa6treY3r5umtpoSOJ0+EklhbW01r54xhk+rrab0AYWCSIqqqW/i3oWro201P3GGlsOWPqHTRyIpqK2tZvWRRn5+1wUMUltN6SMaKYikoPi2mtPL1FZT+o5CQSTFqK2mhEmhIJJC1FZTwqZrCiIp5Gu/irbVfPqOWWqrKaHQSEEkRfx8VRW/WK22mhIuhYJIClBbTUkVCgWRkDU0t/KFH7+htpqSEnRNQSRkDy3ZyKb3DvPDW85VW00JnUYKIiFqa6t5x8UTuezUkWGXI6JQEAnLjgPvt9X8u4+oraakBoWCSAiaWyN8aZHaakrq0TUFkRCoraakKv15ItLH1FZTUplCQaQPqa2mpDqdPhLpI2qrKelAoSDSR9RWU9KBTh+J9AG11ZR0oVAQCZjaako60ekjkQCpraakG40URAKktpqSbhQKIgFZt1NtNSX9KBREAnCksYUvLlzNsAEFaqspaUXXFEQCcP+v1rFNbTUlDWmkINLLnlFbTUljCgWRXlRRfYSv/WodsyapraakJ4WCSC9pa6vZv0BtNSV96ZqCSC958Dfvt9UcNVhtNSU9KRREuqG2vpm3dtaydmcNb1XVsraqlp01R9VWU9KeQkHkBI40trB+Z/SDf+3OWt6qqmHr/vr2x8cPL+Ls8UOZd8kkPnPeuBArFek5hYJInIbmVtbvOsRbVTWsjQVBRfUR3KOPlw7pz7TSYm44dyzTS4cwrbSY4iItXSGZQ6EgWaupJcKm9w6xtqo2egpoZy3v7DlMaySaACWDCplRVszV08cwfWwx00qLGTGwMOSqRYKlUJCs0NIa4d29R1hbVRMNgZ21bNp9mKbWCABDi/KZXjaEy08bybTSYmaMHaKLxZKVFAqScVojzpZ9R6LXAKpqWVtVw4bdh2hojgbAoH55TC8r5taLJjK9LDoCKBvaX0tRiKBQkDTn7mzbX99+AfjNqlrW76ylrqkVgKKCXM4YU8yN541nelkx08uGMH5YETn6DoFIp7ImFPYfaeRQQwtDi/IZ3C9fHwppyN3ZVdvA2h01sRCIjgIONbQAUJiXw9Qxg/nkOWVMKxvCjLJiJpUM1JfIRLoga0LhmVVV/NNzmwDIMRhSVMCQonyGFRUwpKiAYQPyGVpUwNABBQwt+svbxf3zycvVF8D70t5DDe2nf9pCYH9dEwD5ucapowdz1YwxTC8tZlpZMR8YNYh8/Y5EeiTQUDCzK4DvAbnA4+7+zQ6PFwJPAecA+4FPu/vWIGq5fOooRg4u5EBdMzX1TRyoa6KmvpkDdU1UHaxn3c5mDtQ30dQSSXiM4v750ZAYUBANjaKCDttxtwfkM6R/AQV5+pBKxoG6JtZWRb8I9mZVLW/trGHPoUYgGuIfGDWID502kmllQ5heWswpowfRLz835KpFMk9goWBmucDDwIeBKmCFmS129w1xu90GHHT3k81sDvDPwKeDqGdyyUAmlww87j7uztHm1mMC42D9+7dr6ps4UB8NlT2HGnj7vcMcrG+iPnb+ujMDC/MY2jYK+YvgiBuVxIJkaFFBynzYtUacppYITS0RGltb2283tUbev90SobHD9jGPt0ZobInQ2NLa6T4Nza28s+cIO2uOAmAGk0YM4ILJI2LXAIqZelIx/QtS42cikumCHCnMBDa7eyWAmS0CrgXiQ+Fa4B9jt58B/t3MzL3tq0J9y8woKsijqCCPsqHJP6+hubVDcDRxsL6Zmrro7fjHKvcdoaaumcONLQmP1z8/l2EDYqe3BkRPb70fHtFQKczLobGTD+GO9x273dr5/q2df6C3RHrv11CQl0Nhbg4FeXH/cnMozM/hrHFDuPmC8UwrHcIZpYPVx1gkREGGQimwI267Cjgv0T7u3mJmtcBwYF+AdfW6fvm5jC7OZXRx8vPam1oi1BztMAqpa46NTJreP81V30TVwaMcqGui9mhzUsfOMdo/dAvycimM+xCO/0AuKspr3277wC485kM795gP8cIOz/+LD/i2/fJyj7k/P9c03VMkTQQZCp19CnT80zOZfTCzecA8gHHjMmNtmYK8HEYO6sfIQckHSWvEqT0aDZGmlsixH+Kxv7oLcnN0QVxEui3IUKgCxsZtlwG7EuxTZWZ5QDFwoOOB3P1R4FGA8vLyUE4tpYLcHGPYgAK1dxSRwAT5J+UKYIqZTTSzAmAOsLjDPouBm2O3Pwn8MazrCSIiEuBIIXaN4B7gt0SnpD7h7uvN7BvASndfDPwA+JGZbSY6QpgTVD0iInJigX5Pwd2XAEs63Hd/3O0G4FNB1iAiIsnTFUkREWmnUBARkXYKBRERaadQEBGRdgoFERFpp1AQEZF2CgUREWln6fYFYjOrBrZ18+kjSLPF9o5D7yX1ZMr7AL2XVNWT9zLe3UtOtFPahUJPmNlKdy8Pu47eoPeSejLlfYDeS6rqi/ei00ciItJOoSAiIu2yLRQeDbuAXqT3knoy5X2A3kuqCvy9ZNU1BREROb5sGymIiMhxKBRERKRd1oWCmf2jme00szWxfx8Nu6aeMrO/NTM3sxFh19JdZvaAma2N/U5eMLMxYdfUHWb2bTPbFHsvvzSzIWHX1F1m9ikzW29mETNLuymdZnaFmb1tZpvN7Cth19MTZvaEme01s3VBv1bWhULMd939zNi/JSfePXWZ2Vjgw8D2sGvpoW+7+3R3PxN4Frj/RE9IUb8DznD36cA7wD+EXE9PrAOuA5aHXUhXmVku8DBwJTAVmGtmU8OtqkeeBK7oixfK1lDIJN8F7gPSesaAux+K2xxAmr4fd3/B3Vtim68BZWHW0xPuvtHd3w67jm6aCWx290p3bwIWAdeGXFO3uftyoi2LA5etoXBPbHj/hJkNDbuY7jKza4Cd7v5m2LX0BjN70Mx2ADeSviOFeLcCz4VdRJYqBXbEbVfF7pMTCLRHc1jM7PfA6E4e+iqwAHiA6F+iDwD/SvR/3pR0gvfyv4C/6tuKuu9478Xdf+3uXwW+amb/ANwDfL1PC0zSid5HbJ+vAi3Aj/uytq5K5r2kKevkvrQcffa1jAwFd788mf3M7DGi569TVqL3YmbTgInAm2YG0dMUb5jZTHd/rw9LTFqyvxfgaeA3pGgonOh9mNnNwFXAhzzFvwjUhd9JuqkCxsZtlwG7QqolrWTd6SMzOylu8xNEL6alHXd/y91HuvsEd59A9H+Cs1M1EE7EzKbEbV4DbAqrlp4wsyuAvweucff6sOvJYiuAKWY20cwKgDnA4pBrSgtZ941mM/sRcCbRoeRW4PPuvjvUonqBmW0Fyt09LZcINrOfA6cAEaJLo9/p7jvDrarrzGwzUAjsj931mrvfGWJJ3WZmnwC+D5QANcAad/9IuFUlLzbd/N+AXOAJd38w5JK6zcwWArOJLp29B/i6u/8gkNfKtlAQEZHEsu70kYiIJKZQEBGRdgoFERFpp1AQEZF2CgUREWmnUJCMY2ZHevj8Z8xsUuz21t5efdbMliaz6mgyr21mv0/npVok9SgUROKY2elArrtXhl1Lkn4E3B12EZI5FAqSsSzq22a2zszeMrNPx+7PMbPN0ciMAAACo0lEQVT5sV4Bz5rZEjP7ZOxpNwLHXfPHzGaa2Stmtjr231Ni999iZr8ys/82sy1mdo+ZfTm232tmNizuMDfFnrvOzGbGnj881ktitZk9Qtz6PbHjrorVPC/uOIuBuT3/aYlEKRQkk11H9NvrM4DLgW/Hljm5DpgATANuB86Pe86FwKoTHHcTcIm7n0V0NdeH4h47A/gM0aWbHwTqY/u9Cnw2br8B7n4B0b/yn4jd93Xg5dj+i4Fxcfvf6u7nAOXAvWY2HMDdDwKFbdsiPZWRC+KJxFwELHT3VmCPmS0Dzo3d/zN3jwDvmdmLcc85Cag+wXGLgf+MrdfkQH7cYy+6+2HgsJnVAv8du/8tYHrcfgshuk6+mQ2OdWi7hGhg4e6/MbODcfvfG1t2AqILvU3h/aU09gJj4rZFuk0jBclknS2ffLz7AY4C/U5w3AeIfvifAVzdYf/GuNuRuO0Ix/4R1nF9GU9wP2Y2m+hI53x3nwGs7vCa/WJ1i/SYQkEy2XLg02aWa2YlRP8Sfx14Gbg+dm1hFNGFxtpsBE4+wXGLgbbF+m7pZm1t1zcuAmrdvTZW742x+68E2mYVFQMH3b3ezE4FZrUdxKLrpo8murijSI/p9JFksl8SvV7wJtG/wO9z9/diK7J+iOiy6e8AfwZqY8/5DdGQ+H3ccdaaWSR2+6fAt4iePvoy8Mdu1nbQzF4BBvN+k6f/Ayw0szeAZbzfd/t54E4zWwu8TbTNZ5tziK7E2oJIL9AqqZKVzGygux+JXaB9HbgwFhj9gRdj263hVnliZvY9YLG7/yHsWiQzaKQg2erZ2MXdAuCBtuZE7n7UzL5OtJ/v9uMdIEWsUyBIb9JIQURE2ulCs4iItFMoiIhIO4WCiIi0UyiIiEg7hYKIiLT7/6ExB4wxVmraAAAAAElFTkSuQmCC\n",
      "text/plain": [
       "<matplotlib.figure.Figure at 0x18a77940>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# 画正则参数和误差平方和的关系\n",
    "\n",
    "plt.figure(figsize=(6,5))\n",
    "plt.plot(np.log10(Lam),rmse) \n",
    "plt.xlabel('log(Lambda)')\n",
    "plt.ylabel('rmse')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 由上面结果得到，当Lambda = 0.0001时，误差平方和最小。因此预测Lambda = 0.0001时，用户对活动的打分,保存至文件svdCFReco.mtx\n",
    "\n",
    "svdCFReco = ss.dok_matrix((n_users,n_items))    # svd 用户-活动分值矩阵\n",
    "\n",
    "np.random.seed(1)\n",
    "k = 20 \n",
    "bu = np.zeros(n_users)                          \n",
    "bi = np.zeros(n_items)                           \n",
    "puk = random((n_users,k))/10*(np.sqrt(k))       \n",
    "qki = random((k,n_items))/10*(np.sqrt(k))    \n",
    "ave = np.mean(data[:,2])                      \n",
    "maxIter = 20        \n",
    "gamma = 0.1        \n",
    "Lambda = 0.0001       \n",
    "\n",
    "for it in range(maxIter):                                                   \n",
    "    for i in range(n):\n",
    "        u_id = userIndex[str(data[i,0])]                            \n",
    "        e_id = eventIndex[str(data[i,1])]    \n",
    "        rat = int(data[i,2])     \n",
    "        svdCFReco[u_id,e_id] = pred_SVD(u_id,e_id,bu,bi,puk,qki,ave)                                    \n",
    "        eui = rat - pred_SVD(u_id,e_id,bu,bi,puk,qki,ave) \n",
    "        bu[u_id] += gamma*(eui-Lambda*bu[u_id])                \n",
    "        bi[e_id] += gamma*(eui-Lambda*bi[e_id]) \n",
    "        temp = puk[u_id,:]\n",
    "        puk[u_id,:] += gamma*(eui*qki[:,e_id]-Lambda*puk[u_id,:])\n",
    "        qki[:,e_id] += gamma*(eui*temp-Lambda*qki[:,e_id])\n",
    "\n",
    "sio.mmwrite('svdCFReco',svdCFReco)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "小结5：    \n",
    "基于模型的协同过滤，通过对评分矩阵进行降维(puk * qki)和梯度下降法，在不同参数下(主要是改变正则参数Lambda)，预测用户对活动的评分，采用RMSE(总的误差平方和)作为评测标准，比较了训练集的预测分值与实际分值的误差平方和：    \n",
    "1) 随着迭代次数增大，误差平方和下降，并趋于稳定。对于Lambda = 0.015，当迭代次数为20次时，误差平方和下降为：0.2106。     \n",
    "2) 改变正则参数Lambda,在Lambda比较小的时候，误差平方和也比较小；随着Lambda增大，误差平方和迅速增大。当Lambda = 0.0001时，误差平方和最小，为：0.00027。    \n",
    "基于模型的协同过滤比基于用户和活动的协同过滤能更好的预测用户对活动的评分，并且正则参数Lambda对预测分值的影响非常大，在Lambda<1时，预测效果比较好。    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "--------------------------------------------------------\n",
    "# 6.  组合各种前面生成的特征，生成新的训练数据，并对数据进行保存 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 读取文件 \n",
    "\n",
    "userCFReco = sio.mmread('userCFReco')\n",
    "eventCFReco = sio.mmread('eventCFReco')\n",
    "svdCFReco = sio.mmread('svdCFReco').todense() \n",
    "\n",
    "userReco = sio.mmread('userReco')      \n",
    "eventProp_Reco = sio.mmread('eventPropReco')\n",
    "eventCont_Reco = sio.mmread('eventContReco')\n",
    "\n",
    "userPop = sio.mmread('numFriends')\n",
    "friendInfluence = sio.mmread('userFriends').todense() \n",
    "eventPop = sio.mmread('eventPopularity').todense()\n",
    " "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 定义一个类，读取文件\n",
    "\n",
    "class RecommonderSystem:\n",
    "  \n",
    "    def __init__(self):\n",
    "\n",
    "        self.userCFReco = sio.mmread('userCFReco')\n",
    "        self.eventCFReco = sio.mmread('eventCFReco')\n",
    "        self.svdCFReco = sio.mmread('svdCFReco').todense() \n",
    "\n",
    "        self.userReco = sio.mmread('userReco')      \n",
    "        self.eventProp_Reco = sio.mmread('eventPropReco')\n",
    "        self.eventCont_Reco = sio.mmread('eventContReco')\n",
    "\n",
    "        self.userPop = sio.mmread('numFriends')\n",
    "        self.friendInfluence = sio.mmread('userFriends').todense() \n",
    "        self.eventPop = sio.mmread('eventPopularity').todense() \n",
    "\n",
    "        self.userIndex = pickle.load(open(\"userIndex.pkl\", 'rb'))\n",
    "        self.eventIndex = pickle.load(open(\"eventIndex.pkl\", 'rb'))\n",
    "        self.n_users = len(self.userIndex)\n",
    "        self.n_items = len(self.eventIndex)\n",
    "                      "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 定义一个函数，把前面基于用户、活动、模型的协同过滤，以及生成的各种特征组合在一起，生成新的训练数据，并对数据进行保存\n",
    "\n",
    "def generateRSData(RS,train=True, header=True):\n",
    "    \n",
    "    fn = \"train.csv\" if train else \"test.csv\"\n",
    "    fin = open(fn)\n",
    "    fout = open(\"RS_\" + fn,'w')\n",
    "\n",
    "    fin.readline() \n",
    "    if header:\n",
    "        ocolnames = [\"invited\", \"userCF_reco\", \"evtCF_reco\",\"svdCF_reco\",\"user_reco\", \"evt_p_reco\",\\\n",
    "                     \"evt_c_reco\", \"user_pop\", \"frnd_infl\", \"evt_pop\"]\n",
    "        if train:\n",
    "            ocolnames.append(\"interested\")\n",
    "            ocolnames.append(\"not_interested\")\n",
    "        fout.write(\",\".join(ocolnames) + \"\\n\")\n",
    "                \n",
    "    ln = 0\n",
    "    for line in fin:\n",
    "        ln += 1 \n",
    "        if ln%500 == 0:\n",
    "            print('%s:%d (userId,eventId)=(%s,%s)' % (fn,ln,userId,eventId))\n",
    "            \n",
    "        cols = line.strip().split(',')\n",
    "        userId = userIndex[cols[0]]\n",
    "        eventId = eventIndex[cols[1]]\n",
    "        invited = cols[2]\n",
    "\n",
    "        userCF_reco = userCFReco[userId,eventId]\n",
    "        itemCF_reco = eventCFReco[userId,eventId]\n",
    "        svdCF_reco = svdCFReco[userId,eventId]\n",
    "\n",
    "        user_reco = userReco[userId,eventId]\n",
    "        evt_p_reco = eventProp_Reco[userId,eventId]\n",
    "        evt_c_reco = eventCont_Reco[userId,eventId]\n",
    "      \n",
    "        user_pop = userPop[0,userId]\n",
    "        frnd_infl = friendInfluence[userId].sum()\n",
    "        evt_pop = eventPop[eventId,0]\n",
    "        \n",
    "        ocols = [invited, userCF_reco, itemCF_reco, svdCF_reco,user_reco, evt_p_reco,evt_c_reco, user_pop, frnd_infl, evt_pop]\n",
    "        if train: \n",
    "            ocols.append(cols[4]) # interested\n",
    "            ocols.append(cols[5]) # not_interested \n",
    "        fout.write(\",\".join(map(lambda x: str(x), ocols)) + \"\\n\")\n",
    "    \n",
    "    fin.close()\n",
    "    fout.close()    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "RS = RecommonderSystem()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "生成训练数据...\n",
      "train.csv:500 (userId,eventId)=(1211,8008)\n",
      "train.csv:1000 (userId,eventId)=(2460,5930)\n",
      "train.csv:1500 (userId,eventId)=(2257,3388)\n",
      "train.csv:2000 (userId,eventId)=(436,12870)\n",
      "train.csv:2500 (userId,eventId)=(1552,8147)\n",
      "train.csv:3000 (userId,eventId)=(179,12719)\n",
      "train.csv:3500 (userId,eventId)=(2290,13234)\n",
      "train.csv:4000 (userId,eventId)=(863,12923)\n",
      "train.csv:4500 (userId,eventId)=(3099,11919)\n",
      "train.csv:5000 (userId,eventId)=(1180,6926)\n",
      "train.csv:5500 (userId,eventId)=(404,9535)\n",
      "train.csv:6000 (userId,eventId)=(2260,2364)\n",
      "train.csv:6500 (userId,eventId)=(1960,9535)\n",
      "train.csv:7000 (userId,eventId)=(2900,10889)\n",
      "train.csv:7500 (userId,eventId)=(2021,3444)\n",
      "train.csv:8000 (userId,eventId)=(2545,4270)\n",
      "train.csv:8500 (userId,eventId)=(632,535)\n",
      "train.csv:9000 (userId,eventId)=(294,4698)\n",
      "train.csv:9500 (userId,eventId)=(1742,12762)\n",
      "train.csv:10000 (userId,eventId)=(47,88)\n",
      "train.csv:10500 (userId,eventId)=(461,7481)\n",
      "train.csv:11000 (userId,eventId)=(1762,1882)\n",
      "train.csv:11500 (userId,eventId)=(934,12733)\n",
      "train.csv:12000 (userId,eventId)=(943,1412)\n",
      "train.csv:12500 (userId,eventId)=(2632,6686)\n",
      "train.csv:13000 (userId,eventId)=(2215,3196)\n",
      "train.csv:13500 (userId,eventId)=(600,9913)\n",
      "train.csv:14000 (userId,eventId)=(1518,2991)\n",
      "train.csv:14500 (userId,eventId)=(2757,8300)\n",
      "train.csv:15000 (userId,eventId)=(41,11345)\n",
      "\n",
      "生成预测数据...\n",
      "test.csv:500 (userId,eventId)=(3173,3040)\n",
      "test.csv:1000 (userId,eventId)=(2677,10248)\n",
      "test.csv:1500 (userId,eventId)=(924,7807)\n",
      "test.csv:2000 (userId,eventId)=(1418,7370)\n",
      "test.csv:2500 (userId,eventId)=(2136,2626)\n",
      "test.csv:3000 (userId,eventId)=(775,3040)\n",
      "test.csv:3500 (userId,eventId)=(1879,7750)\n",
      "test.csv:4000 (userId,eventId)=(3287,9301)\n",
      "test.csv:4500 (userId,eventId)=(2830,8776)\n",
      "test.csv:5000 (userId,eventId)=(2191,3040)\n",
      "test.csv:5500 (userId,eventId)=(2975,11353)\n",
      "test.csv:6000 (userId,eventId)=(2820,3690)\n",
      "test.csv:6500 (userId,eventId)=(1609,10304)\n",
      "test.csv:7000 (userId,eventId)=(1197,8777)\n",
      "test.csv:7500 (userId,eventId)=(1816,5847)\n",
      "test.csv:8000 (userId,eventId)=(3227,12042)\n",
      "test.csv:8500 (userId,eventId)=(2337,1320)\n",
      "test.csv:9000 (userId,eventId)=(718,10120)\n",
      "test.csv:9500 (userId,eventId)=(747,5104)\n",
      "test.csv:10000 (userId,eventId)=(2911,3915)\n"
     ]
    }
   ],
   "source": [
    "print(\"生成训练数据...\")\n",
    "generateRSData(RS,train=True,  header=True)\n",
    "\n",
    "print(\"\\n生成预测数据...\")\n",
    "generateRSData(RS, train=False, header=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "小结6：     \n",
    "把前面基于用户、活动、模型的协同过滤，以及生成的各种特征组合在一起，生成新的训练数据，并保存新数据到文件RS_train.csv和RS_test.csv。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----------------------------------------------------\n",
    "# 总结：  \n",
    "1) 分析数据集中的用户('user')和活动('event')属性，主要分析结果保存为文件，包括：userIndex和eventIndex，eventsForUser和usersForEvent，uniqueUserPairs和uniuqeEventPairs，userMatrix，eventPropMatrix，eventContMatrix，eventPropSim，eventContSim，numFriends，userFriendseventPopularity。 \n",
    "\n",
    "2) 基于用户的协同过滤，用两种方法计算两个用户user_1和user_2之间的相似度，再预测用户对活动的评分，采用RMSE(总的误差平方和)作为评测标准，比较了训练集的预测分值与实际分值的误差平方和。     \n",
    "2.1) 根据两个用户对所有活动的打分，用Pearson相关系数(公式的第三种形式)计算两个用户的相似度，再由用户相似度计算用户对活动的推荐评分(userCFReco.mtx文件)，得到在训练集上预测分值与实际分值的误差平方和为：0.51797。      \n",
    "2.2) 根据用户本身的特征属性，用Pearson相关系数(公式的第一种形式)计算两个用户的相似度，得到在训练集上预测分值与实际分值的误差平方和为：0.51794。      \n",
    "由两种方法计算相似度得到的预测分值与实际分值的误差平方和相同。\n",
    "\n",
    "3) 基于活动的协同过滤，用两种方法计算两个活动event_1和event_2之间的相似度，再预测用户对活动的评分，比较了训练集的预测分值与实际分值的误差平方和。     \n",
    "3.1) 根据所有用户对两个不同活动的打分，用Pearson相关系数计算两个活动的相似度，得到在训练集上预测分值与实际分值的误差平方和为：0.5181。   \n",
    "3.2) 根据活动本身的特征属性，分别对非词频特征和词频特征用Pearson相关系数计算两个活动的相似度，得到非词频特征和词频特征在训练集上预测分值与实际分值的误差平方和都是0.5180。       \n",
    "由两种方法计算相似度得到的预测分值与实际分值的误差平方和相同，并且根据不同的特征属性，非词频特征和词频特征得到的误差平方和也是相同的。   \n",
    "\n",
    "4) 基于模型的协同过滤，通过对评分矩阵进行降维(puk * qki)和梯度下降法，在不同参数下(主要是改变正则参数Lambda)，预测用户对活动的评分，比较了训练集的预测分值与实际分值的误差平方和。     \n",
    "4.1) 随着迭代次数增大，误差平方和下降，并趋于稳定。对于Lambda = 0.015，当迭代次数为20次时，误差平方和下降为：0.2106。     \n",
    "4.2) 改变正则参数Lambda,在Lambda比较小的时候，误差平方和也比较小；随着Lambda增大，误差平方和迅速增大。当Lambda = 0.0001时，误差平方和最小，为：0.00027。     \n",
    "\n",
    "5) 把前面基于用户、活动、模型的协同过滤，以及生成的各种特征组合在一起，生成新的训练数据，并保存新数据到文件RS_train.csv和RS_test.csv。 \n",
    "\n",
    "结论：       \n",
    "1)  基于用户和基于活动的协同过滤得到的误差平方和相同(相近)。     \n",
    "2) 基于模型的协同过滤比基于用户和活动的协同过滤能更好的预测用户对活动的评分，并且正则参数Lambda对预测分值的影响非常大，在Lambda<1时，预测效果比较好。    \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
